Purpose: In this notebook we’ll continue our exploration of ensembles by looking at boosting methods. While our previous topic – bagging and random forests – looked at models in parallel, boosting methods use models in series. That is, boosting methods chain models together, passing information from previous models as inputs to subsequent models. In particular,
Boosting methods typically try to slowly chip away at the reducible error. In the first iteration of boosting, we build a weak learner (a high-bias model) to predict our response – in the next iteration, we build another weak learner in order to make predictions that will reduce the error from the first model. Subsequent boosting iterations build weak learners to reduce prediction errors left over from previous rounds.
We’ll use the regression setting to introduce boosting methods in this notebook though this technique is applicable to classification as well. There are a few additional intricacies in the classification setting, but the main idea is the same. Let’s see boosting in action using a small example with a single predictor. We’ll start with a toy dataset.
We’ll plot the results of four rounds of boosting below.
We can see that the boosting iterations each try to [very slowly] reduce the total error made by the model.
We should beware of the following when using boosting methods.
{tidymodels}
A boosted model is a model class (that is, a model
spec
ification). We define our intention to build a
boosting classifier using
boost_tree_spec <- boosted_tree() %>%
set_engine("xgboost") %>%
set_mode("classification") #or "regression"
As with many of our model specifications, boosting models can be used
for both regression and classification. For this reason, the
line to set_mode()
is required when declaring the model
specification. The line to set_engine()
above is
unnecessary since xgboost
is the default engine. There are
other available engines though.
Like other model classes, boosted models have tunable hyperparameters. They are
mtry
, which determines the number of randomly chosen
predictors to offer each tree at each decision juncture.
trees
determines the number of trees in the
forest.
min_n
is an integer determining the minimum number
of training observations required for a node to be split further. That
is, if a node/bucket contains fewer than min_n
training
observations, it will not be split further.
tree_depth
is an integer denoting the maximum depth
of each individual tree (not available for all engines).
learn_rate
determines how quickly the model will
attempt to learn (initial boosting iterations are weighted more heavily,
while later iterations have less influence in the model’s ultimate
predictions).
1e-5, 1e-3, 0.1, 10
, are
typically a good starting point for learning rates.Additional hyperparameters are loss_reduction
,
sample_size
, and stop_iter
.
You can see the full {parsnip}
documentation for
boost_tree()
, including descriptions of those last three
hyperparameters, here.
{sklearn}
A support vector classifier is a model class. We first import
GradientBoostingClassifier
from
sklearn.ensemble
and then create an instance of the model
constructor using:
from sklearn.ensemble import GradientBoostingClassifier
gb_clf = GradientBoostedClassifier()
Like other model classes, boosted models have tunable hyperparameters. The ones you are most likely to use are
max_features
, which determines the number of
randomly chosen predictors to offer each tree at each decision
juncture.
n_estimators
determines the number of trees in the
forest.
min_samples_split
is an integer (or float)
determining the minimum number (or proportion) of training observations
required for a node to be split further. That is, if a node/bucket
contains fewer than min_samples_split
training
observations, it will not be split further.
max_depth
is an integer denoting the maximum depth
of each individual tree (not available for all engines).
learning_rate
determines how quickly the model will
attempt to learn (initial boosting iterations are weighted more heavily,
while later iterations have less influence in the model’s ultimate
predictions).
1e-5, 1e-3, 0.1, 10
, are
typically a good starting point for learning rates.You can see the full {sklearn}
documentation for
GradientBoostedClassifier()
, including descriptions of
those last three hyperparameters, here.
In this notebook we were introduced to the notion of boosting methods. These are slow-learning techniques aimed at chipping away at the reducible error made by our models. We’ll implement boosting at our next class meeting.