Purpose: In this notebook we’ll introduce the notion of ensembles of models. In particular, we will introduce bootstrap aggregation (or bagging) and random forests. After reading through this notebook you should understand the following.

The Big Idea

All this time we’ve been looking at individual model classes and searching for the best model, but we’ve perhaps presented ourselves with a false choice. Why not build a collection of models and then aggregate their individual predicted responses in order to obtain an ultimate prediction? This is the main idea behind ensembles.

Bootstrapping

Bootstrapping is a widely used technique to generate new, hypothetical random samples. We treat our available sample data as if it were the population, and repeatedly draw random samples from it. These random samples have the same size as the original set of sample data and are drawn with replacement.

For example, consider a small dataset with 10 observations, numbered 1 - 10. The data frame below contains the original observations and nine bootstrapped samples.

original bs_1 bs_2 bs_3 bs_4 bs_5 bs_6 bs_7 bs_8 bs_9
1 5 9 10 6 1 10 7 6 9
2 5 5 10 9 6 6 6 8 1
3 8 8 5 2 7 2 10 2 7
4 1 10 3 9 1 10 3 2 2
5 3 1 9 3 4 9 6 2 7
6 5 9 8 3 1 3 7 4 1
7 4 3 7 8 7 6 5 1 6
8 4 7 2 3 8 7 4 9 2
9 10 5 9 2 7 6 8 1 2
10 5 2 6 10 7 7 5 1 9

Now we have 10 sets of training data we could use to fit models!

Bootstrap Aggregation (Bagging)

Once we have additional training sets obtained via the bootstrap, we can fit a model on each of these training sets. We could do this with any class of model, but trees are the most commonly utilized. Given the example data from above, we could fit a model on each of the available training sets, giving us 10 models. Once we have those trained models, we can use those models to make predictions. The act of distilling all of these predictions down to a single predictions is called aggregation. There are several techniques:

  • In classification applications, we can allow each model to vote on the class.
  • In classification applications, we can average the “certainty” of the model’s predictions for membership to each class.
  • In regression applications, we can average the predictions together.

Note: We aren’t restricted to averaging – we can try any aggregation method we like.

Random Forests

There’s a problem with bagging in that, all of the resulting models are likely to look similar. They have access to similar data and the same predictors. Unfortunately, this means that these models are likely to make the same mistakes – we often say that their prediction errors are correlated. We don’t really benefit by having lots of models if they all do the same thing and make similar errors. We need models that make different errors so that, on the whole, the errors will balance one another out. Is it possible that building several worse models will lead to a better ensemble overall? That’s the gist of the phenomenon known as the Wisdom of the Crowd.

A random forest is a form of bootstrap aggregation where we construct a decision tree model on each of the bootstrapped training sets. Rather than just constructing a decision tree though, we allow each tree access to only a random subset of predictors each time it makes a split. Since the trees are provided random access to the predictors, our trees won’t all look alike. This means that the decision trees in our random forest ensemble won’t make the same errors and the ensemble can benefit from the wisdom of the crowd.

Because a random forest is an ensemble of models, typically consisting of hundreds of trees, it has much lower interpretive value than an individual decision tree has. That being said, we can look at variable importance plots to determine which variables were selected most often by trees in the forest when the opportunity to take them arose. In this way, we are able to interpret which features are most strongly associated with the response variable.

Some Warnings

It will be useful to beware of the following regarding random forests.

How to Implement in {tidymodels}

A Random Forest is a model class (that is, a model specification). We define our intention to build a Random Forest classifier using

rf_clf_spec <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

As with many of our model specifications, random forests can be used for both regression and classification. For this reason, the line to set_mode() is required when declaring the model specification. The line to set_engine() above is unnecessary since `ranger`` is the default engine. There are other available engines though.

Hyperparameters and Other Extras

Like other model classes, random forests have tunable hyperparameters. They are

  • mtry, which determines the number of randomly chosen predictors to offer each tree at each decision juncture.
  • trees determines the number of trees in the forest.
  • min_n is an integer determining the minimum number of training observations required for a node to be split further. That is, if a node/bucket contains fewer than min_n training observations, it will not be split further.

You can see the full {parsnip} documentation for rand_forest() here.


Summary

In this notebook you were introduced to the notions of the bootstrap, bootstrap aggregation (or bagging), and random forests. None of these techniques were actually implemented in this notebook, we will implement them in our next class meeting.