R Tutorial at the WZB

1. Data pre-processing

Open the file “bostonhousing.R” included in the file “bostonhousing.zip” from the tutorial website into RStudio. Run the script line by line and try to understand what it’s doing. Next, try out different preprocessing steps and see how the results change, e.g.:

change the cutoff value for removing correlated predictors
apply data transformations to predictors with skewed distributions
identify outliers, think about removing observations with outliers

2. Tuning a Random Forest model

In the next tutorial session, we will learn about Random Forests (RF) for classification. However, this method can also be used for regression. Try to train a RF model similar to the way it was done for the ridge regression model. You may need to install the package randomForest before.

This model has one important hyperparameter, which you should tune: mtry. This parameter specifies how many predictors are selected randomly for constructing decision trees (more on that in the next tutorial session). Create a set of hyperparameter candidates for this parameter, ranging from 2 to 9 (or more if you like, but of course not more than the number of predictors in the training data set). Then tune the model using train() with the method parameter set to "rf". Note that for RFs, predictors must not be scaled and/or centered.

Note that when you run train() the computations will take much longer than with the ridge regression, because Random Forests are computationally intensive.

Which value for mtry creates the best model? Are the performance measurements for the best model better than those of the best ridge regression model? Try to replicate the plots for evaluating the tuning results and the best model fit from the ridge regression also for the RF method. Note that the call to predict() for predicting outcomes using a model and predictor data is different than for the ridge regression. It is simply:

predict(<model>, <predictors>)
# e.g. predict(tuning_rf$finalModel, boston_train_X)

RF models are hard to interprete as they consist of hundreds of decision trees. However, it is possible to get an accumulated measurement of the “variable importance”. Run randomForest::varImpPlot(<finalmodel>) in order to find out which predictors contribute most to the predictions of your best RF model.

3. Final validation

Make a final validation of the best models (the best ridge regression and the best RF model) using the held-out test data. Does the RF model outperform ridge regression? If so, are there still reasons to choose the ridge regression model instead?

R Tutorial at the WZB

Tasks for 12 – Introduction to Machine Learning with R I

Markus Konrad

January 31, 2019

1. Data pre-processing

2. Tuning a Random Forest model

3. Final validation