Purpose: In this notebook we’ll introduce the notion of ensembles of models. In particular, we will introduce bootstrap aggregation (or bagging) and random forests. After reading through this notebook you should understand the following.
An ensemble is a collection of models used together to generate a predicted response.
An ensemble can implement models in parallel or in series.
Bootstrapping is a technique to generate new, hypothetical random samples of observations.
A random forest is a collection of decision trees which are presented a random subset of features at each decision juncture.
All this time we’ve been looking at individual model classes and searching for the best model, but we’ve perhaps presented ourselves with a false choice. Why not build a collection of models and then aggregate their individual predicted responses in order to obtain an ultimate prediction? This is the main idea behind ensembles.
Bootstrapping is a widely used technique to generate new, hypothetical random samples. We treat our available sample data as if it were the population, and repeatedly draw random samples from it. These random samples have the same size as the original set of sample data and are drawn with replacement.
For example, consider a small dataset with 10 observations, numbered 1 - 10. The data frame below contains the original observations and nine bootstrapped samples.
original | bs_1 | bs_2 | bs_3 | bs_4 | bs_5 | bs_6 | bs_7 | bs_8 | bs_9 |
---|---|---|---|---|---|---|---|---|---|
1 | 8 | 8 | 4 | 9 | 9 | 6 | 3 | 8 | 6 |
2 | 9 | 10 | 9 | 10 | 9 | 10 | 2 | 10 | 6 |
3 | 4 | 2 | 9 | 2 | 3 | 4 | 3 | 10 | 5 |
4 | 9 | 4 | 3 | 5 | 6 | 6 | 8 | 3 | 1 |
5 | 3 | 5 | 10 | 3 | 6 | 3 | 7 | 10 | 1 |
6 | 9 | 2 | 3 | 2 | 5 | 9 | 2 | 2 | 3 |
7 | 9 | 5 | 6 | 3 | 1 | 1 | 1 | 5 | 4 |
8 | 7 | 5 | 2 | 2 | 9 | 4 | 5 | 7 | 2 |
9 | 8 | 3 | 3 | 9 | 2 | 2 | 4 | 5 | 6 |
10 | 3 | 10 | 8 | 9 | 10 | 2 | 5 | 2 | 2 |
Now we have 10 sets of training data we could use to fit models!
Once we have additional training sets obtained via the bootstrap, we can fit a model on each of these training sets. We could do this with any class of model, but trees are the most commonly utilized. Given the example data from above, we could fit a model on each of the available training sets, giving us 10 models. Once we have those trained models, we can use those models to make predictions. The act of distilling all of these predictions down to a single predictions is called aggregation. There are several techniques:
Note: We aren’t restricted to averaging – we can try any aggregation method we like.
There’s a problem with bagging in that, all of the resulting models are likely to look similar. They have access to similar data and the same predictors. Unfortunately, this means that these models are likely to make the same mistakes – we often say that their prediction errors are correlated. We don’t really benefit by having lots of models if they all do the same thing and make similar errors. We need models that make different errors so that, on the whole, the errors will balance one another out. Is it possible that building several worse models will lead to a better ensemble overall? That’s the gist of the phenomenon known as the Wisdom of the Crowd.
A random forest is a form of bootstrap aggregation where we construct a decision tree model on each of the bootstrapped training sets. Rather than just constructing a decision tree though, we allow each tree access to only a random subset of predictors each time it makes a split. Since the trees are provided random access to the predictors, our trees won’t all look alike. This means that the decision trees in our random forest ensemble won’t make the same errors and the ensemble can benefit from the wisdom of the crowd.
Because a random forest is an ensemble of models, typically consisting of hundreds of trees, it has much lower interpretive value than an individual decision tree has. That being said, we can look at variable importance plots to determine which variables were selected most often by trees in the forest when the opportunity to take them arose. In this way, we are able to interpret which features are most strongly associated with the response variable.
It will be useful to beware of the following regarding random forests.
max_depth
) or
indirectly (by tuning a parameter like min_n
).{tidymodels}
A Random Forest is a model class (that is, a model
spec
ification). We define our intention to build a
Random Forest classifier using
rf_clf_spec <- rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
As with many of our model specifications, random forests can be used
for both regression and classification. For this reason, the
line to set_mode()
is required when declaring the model
specification. The line to set_engine()
above is
unnecessary since `ranger`` is the default engine. There are other
available engines though.
Like other model classes, random forests have tunable hyperparameters. They are
mtry
, which determines the number of randomly chosen
predictors to offer each tree at each decision juncture.trees
determines the number of trees in the
forest.min_n
is an integer determining the minimum number of
training observations required for a node to be split further. That is,
if a node/bucket contains fewer than min_n
training
observations, it will not be split further.You can see the full {parsnip}
documentation for
rand_forest()
here.
{sklearn}
A random forest is a model class. We first import
RandomForestClassifier
from sklearn.ensemble
and then create an instance of the model constructor using:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
Like other model classes, random forests have tunable hyperparameters. They are
max_features
, which determines the number of randomly
chosen predictors to offer each tree at each decision juncture.n_estimators
determines the number of trees in the
forest.min_samples_split
is an integer (or float) determining
the minimum number (or proportion) of training observations required for
a node to be split further. That is, if a node/bucket contains fewer
than min_samples_split
training observations, it will not
be split further.You can see the full {sklearn}
documentation for
RandomForestClassifier()
here.
In this notebook you were introduced to the notions of the bootstrap, bootstrap aggregation (or bagging), and random forests. None of these techniques were actually implemented in this notebook, we will implement them in our next class meeting.