- Classification / performance measures for classification
- Decision trees
- Random forests
- End of the tutorial
February 7, 2019
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Classification is used in supervised machine learning, i.e. where we have training data with known outcomes.
The aim is to predict a categorical response from unseen data. This can be a two-class (i.e. binary) response or multi-class response.
Although the outcome categories are discrete, the predictions itself are usually continuous class probabilities. Example: The predicted probability of an email to belong to the class "spam" is \(0.7\). Using a probability threshold of \(0.5\), this email would be classified as spam.
Classification models are also called classifiers.
A confusion matrix is a cross-tabulation of observed vs. predicted classes:
Confusion matrix for a two-class problem
→ everything on the diagonal is correctly classified, everything else is misclassified
The Accuracy rate metric is the overall rate of correctly classified samples and can be calculated from a confusion matrix as:
\[ \text{Accuracy} = \frac{TP + TN}{N}, \]
where \(N = TP + TN + FP + FN\) (the number of observations).
Disadvantages:
Confusion matrix for a two-class problem
If we have high class imbalance, a simple model could achieve almost perfect accuracy by simply classifying each response as negative:
# "A" occurs only 0.1% of the time: obs <- factor(rep(c('A', 'notA'), times = c(1, 999))) # a model that classifies everything as "notA": pred <- factor(rep('notA', times = 1000), levels = levels(obs)) (confmat <- table(pred, obs))
## obs ## pred A notA ## A 0 0 ## notA 1 999
## obs ## pred A notA ## A 0 0 ## notA 1 999
Calculating the (impressive) accuracy of our "model":
(confmat[1,1] + confmat[2,2]) / 1000
## [1] 0.999
Metrics that take class imbalance into account:
Confusion matrix for a two-class problem
Sensitivity / True positive rate: Rate that condition A is predicted correctly for all samples that actually have condition A.
\[ \text{Sensitivity} = \frac{TP}{TP + FN} \]
The false negative rate is defined as \(1 - \text{Sensitivity}\).
Specificity / True negative rate: Rate that condition ¬A is predicted correctly for all samples that actually have condition ¬A.
\[ \text{Specificity} = \frac{TN}{TN + FP} \]
The false positive rate is defined as \(1 - \text{Specificity}\).
Example: Should a spam classifier rather focus on high sensitivity or high specificity?
What do the sensitivity and specificity measures tell us practically?
Take for example official test results of a face detection system that was tested by German Federal Police at station Südkreuz, Berlin:
We have \(90,000\) passengers per day at Südkreuz, which means \(90\) "false alarms" → \(90\) innocent people stopped per day.
The CCC later critized the results as being "whitewashed": The actual average FPR was actually \(0.67\%\).
Confusion matrix for a two-class problem
Balanced accuracy takes the mean of both measures:
\[ \text{Balanced acc.} = \frac{\text{Sensitivity} + \text{Specificity}}{2} \]
What would be the balanced accuracy of the "always-predict-negative model"?
## obs ## pred A notA ## A 0 0 ## notA 1 999
(sens <- confmat[1,1] / (confmat[1,1] + confmat[2,1]))
## [1] 0
(spec <- confmat[2,2] / (confmat[1,2] + confmat[2,2]))
## [1] 1
(sens + spec) / 2
## [1] 0.5
Some history:
Idea:
→ trade-off: True positive rate (sensitivity) vs. false positive rate (\(1 - \text{Specificity}\))
ROC curves visualize that trade-off for a given model by evaluating sensitivity and specificity for a range of thresholds.
adapted from Kuhn & Johnson 2016
In the given example, a model with a threshold of \(0.3\) would achieve a higher sensitivity (\(0.6\) vs. \(0.4\)) than a model with a threshold of \(0.5\) for the cost of sacrificing specificity (\(0.786\) vs. \(0.929\)).
ROC curves allow quantitive assessment of overall model performance.
A perfect model would have a true positive rate of 100% and false positive rate of 0%. The area under the ROC curve would be 1:
ROC curves allow quantitive assessment of overall model performance.
An ineffective model's ROC curve would be equal to the diagnoal. The area under the ROC curve would be 0.5:
The number of choices is large for classification models:
We will have a look at Decision Trees and Random Forest for classification and compare them with Logistic Regression.
I decided to use the PRODAT data set (former WZB project – Dieter Rucht / Simon Teune) as an example data set.
polfeld1
and polfeld2
for primary and secondary policy fieldsQuestion of interest for the following examples: How accurately can we predict if a protest (also) exhibits physical violence from the given characteristics?
y
that captures if a protest also included violent actions (derived from variable dform4_4
)polfeld1
is "social security" and polfeld2
is "environment" then both dummies polfeld.social_security
and polfeld.environment
would be set to 1
)y
) to create a training set with 75% of the data and a test set with the restAfter that, we have 8,887 observations with 240 variables (most of which are dummy variables) in the training set and 2,961 observations in the test set.
Important predictors seen later:
formleg
: captures whether a protest was legal, illegal or of unknown status¹zahl
: number of participantstallzahl
: number of groupspolfeld
: policy fielddauer
: duration of protesttragsoz
: social supporters² (e.g. workers, unemployed, women, etc.)nminder
: captures whether a protest was pro or against ethnic minoritiesreapoli
: police action yes/noobjekt
: immediate target of a protestadress
: addressee of a protest (codes: 10
- addressee and immediate target are the same; 21
- pub. institutions; 27
- private persons; …; 30
- whole society)¹ the codebook gives no further information in what "legal" or "illegal" means in the context of protests
² I'm not sure if this is the right term
Other characteristics of the data set:
y
has class imbalance: ~16% of protests are coded as including violence¹ in order to uncover "informative missingness"; however in practice this should probably not be done for all predictors
We compute a logistic regression model as baseline model. This model uses a reduced set of predictors:
polfeld1
and not polfeld2
)zahl
, tallzahl
)We create the model as simple additive function of all remaining 89 predictors:
# family = binomial for logistic regression logreg_model <- glm(y ~ ., data = train_data, family = binomial)
Output of significant (at 5% sign. level) coefficients ordered by absolute magnitude:
term log_odds_ratio odds_ratio p.value 1 formleg.illegal 3.66 39.0 9.40e- 4 2 tragsoz1.arbeitnehmer -3.15 0.0430 6.20e- 8 3 reapoli.ja 1.74 5.71 1.06e- 3 4 tragsoz1.ausl_nder_asyl_ethn_grupp -1.67 0.187 3.63e-16 5 adress.30 -1.44 0.236 2.55e-16 6 tragsoz1.sonstige -1.38 0.251 5.87e-11 7 nminder1.andere_proteste -1.20 0.300 2.18e-11 8 tragsoz1.studenten -0.978 0.376 1.50e- 4 9 objekt.privatpersonen 0.931 2.54 3.48e- 7 10 tragsoz1.ka -0.811 0.444 3.39e-12 ...
The model has some extremely large coefficients, so it would be advisable to use a penalized model instead (like ridge, lasso), but for the moment we only care about the predictive performance:
ROC curve
Decision trees or classification trees (CART algorithm – Breiman et al. 1984) recursively split the data into "purer" groups regarding the response. This requires a measure of "group purity" in order to find out the predictor for splitting and its split value.
Example decision tree (source: Milborrow 2018)
The resulting decision tree is a binary tree: At each node there is a split into two sub-groups. The terminal nodes are also called leaves.
→ example data with outcome is_spam
and predictors that reflect how often a word "dollar" or "hello" occurs in an email:
## is_spam n_dollar n_hello ## 1 TRUE 5 1 ## 2 TRUE 8 2 ## 3 TRUE 6 0 ## 4 FALSE 0 1 ## 5 FALSE 2 1
How would you split the data?
→ choose n_dollar
to be a good predictor for splitting with a value that splits the data into two groups, each purely consisting of "spam" or "not spam" samples:
A decision tree algorithm evaluates each predictor and many possible split points (i.e. thresholds for splitting) by calculating a measure of purity for each prospective pair of sub-groups. One such measure is the Gini-index. The predictor-value combination that maximizes the Gini-index is used for splitting.
The splitting process is done recursively until a stopping criterion is met, e.g.:
Large decision trees tend to overfit the data: They split the training data into very small and very specific sub-groups and hence do not generalize well.
Usually a complexity parameter cp is used to penalize the purity criterion by a factor of the total number of terminal nodes in the tree (→ "pruning"). The larger a tree grows, the more penalty is put on the purity criterion.
An optimal value for cp can be found with hyperparameter tuning on the training data.
We build a small decision tree (high complexity penalty cp = 0.01
) using the package rpart
:
library(rpart) treemodel <- rpart(y ~ ., data = train_data, control = rpart.control(cp = 0.01))
The resulting model can be visualized with rpart.plot
, which allows for easy interpretation:
library(rpart.plot) rpart.plot(treemodel)
Model interpretation can also be assisted by variable importance. We can measure how each predictor overall contributes to increasing the purity criterion (i.e. increasing the Gini-index). → predictors that occur higher / multiple times in tree contribute more to purity optimization
head(treemodel$variable.importance, 10)
formleg.illegal 965.40788 formleg.legal 525.13945 tragsoz.anonym 332.99901 nminder.contra 329.73977 adress.27 85.19441 objekt.privatpersonen 67.40235 aufpo.ja 60.29222 dauer.NA 25.28559 polfeld.ausw_rtiges 16.65478 dauer.1_bis_12_h 14.65957
Note: formleg
has more than two levels.
Why are there predictors listed in the top 10 that did not appear in the tree plot?
Because each predictor's contribution is also measured for alternative trees (surrogate splits).
Evaluating the model using held-out test_data
:
# get class predictions (TRUE / FALSE) pred_classes <- predict(treemodel, newdata = test_data_X, type = 'class') confusionMatrix(pred_classes, test_data$y, positive = 'TRUE')
Reference Prediction FALSE TRUE FALSE 2441 199 TRUE 33 288 Accuracy : 0.9216 Kappa : 0.6697 [...] Sensitivity : 0.59138 Specificity : 0.98666 [...] Balanced Accuracy : 0.78902
# get probability predictions # (matrix with prob. for FALSE and TRUE response respectively) pred_prob <- predict(treemodel, newdata = test_data_X, type = 'prob') library(pROC) roc_curve <- roc(response = test_data$y, predictor = pred_prob[,1]) auc(roc_curve) # AUC ~ 0.90 plot.roc(roc_curve, legacy.axes = TRUE)
Cross-validation yielded an optimal cp = 0.005
which results in slightly better performance but a much more complex tree:
The AUC is at about \(0.90\) which is worse than for the logistic regression:
ROC curves: Gray line is log. regr., black line is dec. tree model
Decision boundaries defined by a tree based model (source: Kuhn & Johnson 2016)
Advantages:
Disadvantages:
Decision trees exhibit several properties that make them a weak learner (high variance, low predictive performance). Ensemble methods try to combine several (hundreds, thousands) weak learners in order to form a "strong learner". Each weak learner's training process involves a random component to lower the variance and increase predictive performance.
Some ensemble methods do random resampling of the observations in the training data for each weak learner (e.g. bootstrap aggregation – bagging). Random Forest is an ensamble technique that uses both random resampling and random predictor selection for each decision tree.
Hyperparameters:
For each tree from 1 to \(n_{tree}\):
Resample training data with replacement (bootstrap sample)
Train a tree model until stopping criterion is reached:
Randomly select \(m_{try}\) predictors out of original predictors
Select the best predictor-value combination for splitting
(i.e. that maximizes Gini-index)
In the end, we have a Random Forest ensemble which consists of \(n_{tree}\) decision trees.
For predicting the outcome of a new sample:
Example:
negative neutral positive votes 687 201 112 prob. 0.69 0.20 0.11
→ final class for new sample would be "negative" with probability \(0.69\)
In contrast to decision trees, Random Forest (RF) can't handle missings in the predictors.¹ Hence we remove zahl
and tallzahl
from the predictors and then build the model using the randomForest
package:
# hyperparameters ntree = 500 and mtry = 50 were chosen from # prior experience here; I will later show a proper tuning rfmodel <- randomForest(x = train_data_X, y = train_data$y, ntree = 500, mtry = 50)
The model can now be used for prediction:
# to predict classes: predict(rfmodel, test_data_X, type = 'response') # to predict class probabilities: predict(rfmodel, test_data_X, type = 'prob')
¹ One reason for that is that reasonable surrogate splits in RF context may not always exist, since predictors are chosen randomly for each split (there may not be a good, correlated surrogate predictor to split on).
Can we see how the outcome for a new sample is actually predicted?
The model is an ensemble of 500 decision trees, each consisting of hundreds or thousands of nodes. For a prediction, each tree generates one class vote. To retrace the decision for a single sample, we would have to investigate how each of the 500 votes were generated.
As an example, we can have a look at the 10th tree that was created (this tree has more than 1000 nodes):
getTree(rfmodel, k = 10, labelVar = TRUE)
left daughter right daughter split var split point status prediction 1 2 3 tverb.ka 0.5 1 <NA> 2 4 5 reapoli.ja 0.5 1 <NA> 3 6 7 objekt.privatpersonen 0.5 1 <NA> 4 8 9 nfrauen.contra 0.5 1 <NA> 5 10 11 nandere.mit_familie 0.5 1 <NA> ...
(couldn't find a way to quickly visualize this)
So, a Random Forest model is essentially a "black box model"; its decisions are almost impossible to retrace. However, it is possible to derive the variable importance using varImp()
or varImpPlot()
:
There is currently a lot of research on "explainable ML models" (e.g. Ribeiro et al. 2016).
RF is computationally extensive: a single RF model took ~10 minutes to compute on my laptop
→ parameter tuning should be done with parallel processing on a cluster machine
library(doMC) registerDoMC(32) # use 32 out of 64 CPU cores on cluster machine rf_tuning <- train(x = train_data_X, y = train_data$y, method = 'rf', tuneGrid = data.frame(.mtry = c(2,25,50,75,100,125,150,200)), trControl = trainControl(method = 'repeatedcv', repeats = 3))
Achieved metrics on test data:
Accuracy : 0.9395 [...] Kappa : 0.7635 [...] Sensitivity : 0.7290 Specificity : 0.9810 [...] Balanced Accuracy : 0.8550
ROC curves: Light gray line is log. regr., dark gray line is dec. tree, black line is RF model
AUC: 0.972 → better than logistic regression, which had AUC of \(0.93\), sensitivity \(0.70\) and specificy \(0.96\)
Advantages:
Disadvantages:
¹ Having several strongly collinear predictors has impact on reliability of variable importance measure.
Reflect on what we just did: We build a classifier for predicting violence in protests.
S. Milborrow 2018: Plotting rpart trees with the rpart.plot package
Bellovin 2019: Yes, “algorithms” can be biased. Here’s why
Exercise:
Stay up-to-date:
Network:
Read: