- Key concepts and terminology
- General workflow for creating predictive models
- Practical Machine Learning with
caret
January 31, 2019
caret
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
source: xkcd.com/1838/
Machine learning (ML) in general
supervised ML
unsupervised ML
Although only popular in the media since about a decade, the term Machine Learning is already around since the late 50s, when "Artificial Intelligence" had had it's first heyday. Most methods come from the field of statistics, probability theory and computer science, some of them several decades old.
Back then, there were two problems to actually apply these methods in large scale: Too few data and too slow computers. Researchers hence focused on "expert systems" (rule-based methods).
Since the advent of the Internet and cheaper, more powerful computers, this has changed and ML is back as highly active research field and an established method applied in many industries.
Some terminology used in these slides and popular synonyms:
Predictive models are supervised ML models primarily built to most accurately make a prediction from input data.
Some predictive models are more complex than "traditional" models such as Ordinary Least Squares (OLS) regression, logistic regression, etc. which can make them hard to interprete ("black box models"). This is accepted as trade-off for higher predictive accuracy.
Example:
Given a set of features about a certain neighborhood in a town, a model should most accurately answer the question: Is it worth to do a campaign here in order to win voters for an election? It would take into account known features of the neighborhood (previous election results, social structure, etc.) and campaigning costs. The interpretation of why a neighborhood is chosen for campaigning or not, is only of secondary interest (e.g. in order to understand why predictions fail).
There is always a tension between predictability and interpretability:
While the primary interest of predictive modeling is to generate accurate predictions, a secondary interest may be to interpret the model and under- stand why it works. The unfortunate reality is that as we push towards higher accuracy, models become more complex and their interpretability becomes more difficult. This is almost always the trade-off we make when prediction accuracy is the primary goal.
– Kuhn & Johnson 2016
What do you think about that? Are "black box models" a danger? Or can they be justified?
Note that there is currently a lot of research about "interpretable ML models", which tries to reduce the tension between highly accurate predictability and interpretability.
One recent approach is LIME (Ribeiro et al. 2016), which we can have a look at in the next session.
Besides potentially higher predictive accuracy, many models have beneficial properties such as:
Which of these properties apply depends on the actual model you use.
For some research questions, high prediction accuracy might be of interest (e.g. forecasting, classification).
Other research questions might focus on whether or how accurate a prediction can be made, rather than what influences those predictions (see Taylor Brown's project about the art market).
Predictive ML models might be used as a tool for parts of your research (e.g. using their strong predictive power for imputation or classification).
Even if you don't intend to use these models in your research studying them might be interesting for you, because they are widely employed in many sectors. To name a few:
For a more comprehensive list and the (often unsettling) implications see Weapons of Math Destruction (O'Neil 2016).
Kuhn & Johnson 2016 identify four main reasons why predictive models fail:
They propose a workflow for building predictive models to avoid these culprits.
→ produces honest model performance estimate because test data was not part of training process
There are several strategies to split the data (depends on data and aim of the model):
How much data is used for the test set?
→ find right balance, because:
Usually 10 to 30% of the sample is used as test data. If only small data is available, use resampling in the test data set.
Data pre-processing transforms, adds or removes variables in the training set data.¹ This process is also called feature engineering.
Depending on the model type, it is important which and how the predictors enter the model, e.g.:
Note that some data transformation techniques reduce (e.g. skewness transformations, PCA) model interpretability.
¹ The final pre-processing steps must also be applied to the test data and final input data in production for proper prediction. However, keep in mind that the test data is left untouched until final model validation.
Common techniques for individual predictors include:
Removing predictors prior modeling has several advantages:
Common techniques include:
Sometimes, it is beneficial to replace one or multiple predictors, e.g.:
This may improve model performance and/or interpretability.
Model tuning is the process of finding optimal model parameters and hyperparameters that both fit the training data well and generalize well to unseen data.
Data splitting, resampling and performance metrics are used to evaluate the model fit and generalizability.
A more detailed look at the model tuning process:
A model that was learnt from the training data set may have the ability to perfectly "re-predict" the correct outcome.
Example: A model trained on a labelled set of emails could perfectly classify emails as spam / not spam if it used the same (but unlabelled) emails as input again.
However, this model also learnt the noise in the data set! It would perform badly on unseen data. This model is said to be over-fitting.
source: Kuhn & Johnson 2016
Many models have hyperparameters. These parameters allow for greater flexibility of the model but cannot be determined analytically.
In order to find an optimal set of hyperparameters, we can create a sequence of hyperparameter candidates \(H\). Then for each set of hyperparameters \(H_i\):
Suppose we have the following data with outcome y
and single predictor x
:
We want to fit a LOESS model (a local regression model) to the data, which has a single hyperparameter, span. It determines the number of the neighborhood points used to calculate the local regression fit (→ the "smoothness" of the fit).
A model function in R usually provides at least one of the following two "interfaces" to specify a model:
The formula interface:
y ~ x1 + x2
and the data sety ~ gender:income
) and transformationsy ~ .
lm(Hwt ~ Sex:Bwt, data = cats)
– linear model for cat's heart weight as function of body weight per gender (using cats
data set from MASS)The matrix interface:
# create numeric matrix with dummy variable from data frame first: catsX <- as.matrix(mutate(cats, female = Sex == 'F') %>% select(-Hwt, -Sex)) enet(x = catsX, y = cats$Hwt, lambda = 0.001)
An example model using the whole cats
data:
library(MASS) # for cats data catsmodel <- lm(Hwt ~ Sex:Bwt, data = cats) summary(catsmodel)
## ## Call: ## lm(formula = Hwt ~ Sex:Bwt, data = cats) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5686 -0.9631 -0.0937 1.0423 5.1252 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.3465 0.8354 -0.415 0.679 ## SexF:Bwt 4.0284 0.3607 11.168 <2e-16 *** ## SexM:Bwt 4.0311 0.2853 14.129 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.458 on 141 degrees of freedom ## Multiple R-squared: 0.6466, Adjusted R-squared: 0.6416 ## F-statistic: 129 on 2 and 141 DF, p-value: < 2.2e-16
Create a data frame with new data:
(newdata <- data.frame(Sex = c('F', 'M'), Bwt = c(2.3, 3.2)))
## Sex Bwt ## 1 F 2.3 ## 2 M 3.2
Predict the response (heart weight in grams) from it:
predict(catsmodel, newdata)
## 1 2 ## 8.918796 12.553010
Predictions seem reasonable (of course, this is not a proper model validation):
group_by(cats, Sex) %>% summarize(mean(Bwt), mean(Hwt))
## # A tibble: 2 x 3 ## Sex `mean(Bwt)` `mean(Hwt)` ## <fct> <dbl> <dbl> ## 1 F 2.36 9.20 ## 2 M 2.9 11.3
Back to our simulated data:
Suppose sine_data
was our training data. We hold out a random 25% of the data:
held_out_indices <- sample(1:nrow(sine_data), size = round(nrow(sine_data) * 0.25)) data_held_out <- sine_data[held_out_indices,] # hold out the 25% for testing data_train <- sine_data[-held_out_indices,] # take the other 75% for training
We now try different values for span and fit a model to data_train
and let it predict values from the hold-out data set. Let's start with span = 0.1:
loess.1_model <- loess(y ~ x, data = data_train, span = 0.1) # # predict y from held-out x: loess.1_pred <- predict(loess.1_model, data_held_out['x'])
We can calculate a performance measure (root mean squared error – RMSE) using the predicted values loess.1_pred
and the true values from the held-out data set:
rmse <- function(pred, obs) { sqrt(mean((pred - obs)^2)) } rmse(loess.1_pred, data_held_out$y)
## [1] 0.2727023
We repeat the same as above for more hyperparameter values, here span = 0.5:
rmse(loess.5_pred, data_held_out$y)
## [1] 0.2630971
And span = 0.9:
rmse(loess.9_pred, data_held_out$y)
## [1] 0.2755069
We see that for this single resampling iteration span = 0.5 yields the best performance (lowest RMSE). To have more confidence, we would repeat the resampling, model fitting and validation several times.
We also see visually that span = 0.5
provides the best model fit:
Diagnostic plots for model with span = 0.5:
Diagnostic plots for model with span = 0.5:
In order to get accurate performance measurements and their standard errors / CIs, we need to repeat the resampling and model fitting. There are several common techniques for that:
(repeated) k-fold cross validation
bootstrapping
Which measures of performance to calculate depend on the prediction problem. Common measures include:
Cost functions may also account for domain-specific adjustments, e.g. a model for political campaigning may take into account travelling distances (by favoring a model with lower travelling distance to a model with higher travelling distance even if the latter would produce better voter outcome).
After repeated model fitting you get a performance estimate for each hyperparameter set. Again, there are several strategies for selecting the final set of hyperparameters:
The model tuning process may be very computationally intensive. This largely depends on:
Example:
It may take hours to days to compute. In order to speed up computations, you should use parallel processing. This will run the computations in parallel on several CPU cores. Modern laptops have 4 to 8 CPU cores. For larger projects, consider using a dedicated cluster machine (e.g. theia at WZB has 64 CPU cores).
caret
caret
packageThe
caret
package (short for Classification and Regression Training) is a set of functions that attempt to streamline the process for creating predictive models. – caret documention manual
→ provides a set of tools for all the steps outlined in the previous slides together with a unified interface to numerous ML / statistical modeling algorithms
Housing Values in Suburbs of Boston: Data set Boston
from package MASS
.
medv
("median value of owner-occupied homes in $1000s")crim
: per capita crime ratenox
: nitrogen oxides concentrationrm
: average number of rooms per dwellingtax
: full-value property-tax rate per $10,000s`ptratio
: pupil-teacher ratio by townlstat
: lower status of the population (percent)?Boston
for full list of predictorsOur goal: Predict the median value of a home (response variable medv
) from the given predictors.
We load the required packages and divide the original data set into training and test data sets:
library(MASS) # for Boston data set library(dplyr) # for data transformations library(caret) # use ~75% of the data for training; get randomly sampled row indices in_train_indices <- createDataPartition(Boston$medv, p = 0.75, list = FALSE) boston_train <- Boston[in_train_indices,] boston_test <- Boston[-in_train_indices,]
Until final model validation, we'll only work with boston_train
!
One of the first things could be to identify skew and outliers in the predictor's distributions. Histograms can be used for that:
# predictor `dis` (weighted mean of distances to five Boston employment centres) qplot(boston_train$dis, bins = 30)
We can apply a logarithmic transformation to reduce the skew:
boston_train$dis_trans <- log(boston_train$dis) boston_train$dis <- NULL qplot(boston_train$dis_trans, bins = 20)
It's also helpfull to identify skewed distributions via the sample skewness statistic (Joanes & Gill 1998):
library(e1071) apply(boston_train, 2, skewness)
## crim zn indus chas nox rm ## 5.1630181 2.2216076 0.2702900 3.1158660 0.7304972 0.3507523 ## age rad tax ptratio black lstat ## -0.6512674 0.9688603 0.6532567 -0.7535644 -2.8878938 0.9179482 ## medv dis_trans ## 1.1047772 0.1611842
A correlation plot reveals pairwise highly correlated variables:
library(corrplot) correl <- cor(boston_train) corrplot(correl, order = 'hclust')
Kuhn & Johnson 2016 suggest a "heuristic approach […] to remove the minimum number of predictors to ensure that all pairwise correlations are below a certain threshold." It is implemented in findCorrelation
:
corr_var_indices <- findCorrelation(correl, cutoff = 0.9) names(boston_train)[corr_var_indices]
## [1] "tax"
Using a cutoff correlation value of 0.9, this only identified tax
to be worth removing from the data set:
boston_train <- boston_train[, -corr_var_indices]
As an example, we'll use ridge regression (Hoerl & Kennard 1970) to built a predictive model from our training data. Ridge regression is an extension to OLS regression which adds a penalty on large coefficients. The hyperparameter \(\lambda \in [0,1]\) controls the strength of the penalty.
We can use model tuning on the training data in order to find a \(\lambda\) that corresponds with the lowest RMSE.
We want to evaluate different values \(0, 0.01, ..., 0.1\) for \(\lambda\). We create a data frame for this with a single column .lambda
(the .
before the parameter name is necessary by convention):
(ridge_hyperparam_sets <- data.frame(.lambda = seq(0, 0.1, by = 0.01)))
## .lambda ## 1 0.00 ## 2 0.01 ## 3 0.02 ## 4 0.03 ## 5 0.04 ## 6 0.05 ## 7 0.06 ## 8 0.07 ## 9 0.08 ## 10 0.09 ## 11 0.10
Some model algorithms require several parameters for which an optimal combination in sought during model tuning. You can generate combinations using expand.grid()
.
The resampling strategy for the tuning process can be specified via trainControl
. The main parameter is method
which, among others, can be set to:
'boot'
: bootstrapping'cv'
: k-fold cross-validation'repeatedcv'
repeated k-fold cross-validationWe'll use 10-fold cross-validation with 5 repeats:
train_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 5)
Now the actual tuning process with repeated cross-validation can be started with train()
. We pass the following arguments:
boston_train_X
boston_train_Y
method = 'ridge'
)tuneGrid = ridge_hyperparam_sets
)trControl = train_ctrl
)Note that the input data should be centered and scaled (Hastie et al. 2017, p.63) so we additionally set preProcess
methods.
boston_train_X <- dplyr::select(boston_train, -medv) # predictors boston_train_Y <- boston_train$medv # outcome tuning_ridge <- train(boston_train_X, boston_train_Y, method = 'ridge', tuneGrid = ridge_hyperparam_sets, trControl = train_ctrl, preProcess = c('center', 'scale'))
tuning_ridge
Ridge Regression ... Pre-processing: centered (11), scaled (11) Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 345, 343, 343, 342, 344, 343, ... Resampling results across tuning parameters: lambda RMSE Rsquared MAE 0.00 4.774223 0.7378075 3.480828 0.01 4.769590 0.7381880 3.459474 0.02 4.768482 0.7382246 3.442588 0.03 4.769962 0.7380142 3.429599 ... 0.10 4.815535 0.7335818 3.407671 RMSE was used to select the optimal model using the smallest value. The final value used for the model was lambda = 0.02.
We see a table of performance measures for each evaluated lambda
. The best model was chosen with lambda = 0.02
as it achieved the lowest RMSE.
The best model is off by about $4,780 (RMSE) or $3,440 (MAE). We can inspect the fit quickly by re-predicting the training data:
# use the same preprocessing as in training first boston_train_X_scaled <- predict(tuning_ridge$preProcess, newdata = boston_train_X) # we can access the selected model via `tuning_ridge$finalModel`. ridge_pred <- predict(tuning_ridge$finalModel, newx = boston_train_X_scaled, s = 1, type = 'fit', mode = 'fraction')$fit
qplot(boston_train_Y, ridge_pred, geom = 'point') + geom_abline(intercept = 0, slope = 1) + coord_fixed() + xlab('observed') + ylab('predicted') + theme_minimal()
We can see that there are some problems especially on the upper end of the response scale. Also, there seem to be some strange values of exactly 50.00 for the response which we should investigate further.
Ridge regression does not produce a "black box" model. We can have a look at the model's coefficients:
predict(tuning_ridge$finalModel, s = 1, type = 'coefficients', mode = 'fraction')$coefficients
## crim indus chas nox rm age ## -1.4718570 -1.1913948 0.7673612 -2.8105930 2.8128855 -0.2938233 ## rad ptratio black lstat dis_trans ## 1.2628486 -2.0301905 0.7796135 -3.8201206 -4.2611391
We can see that the log transformed value of the distance to employment centres (dis_trans
) has big negative effect on house values, as well as the percentage of "lower status of the population" (lstat
). The average number of rooms (rm
) has a positive effect.
Besides investigating the problem with the large number of outcome values of exactly 50.00 (which probably is a data set issue), the next steps would include:
In this week's excercise you can complete a script with these steps.
See dedicated tasks sheet on the tutorial website.