Introduction to Decision Tree Classifiers

#knitr::opts_chunk$set(eval = FALSE)
library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(reticulate)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from plotnine import ggplot, aes, geom_point, labs, scale_color_manual, theme
from sklearn.tree import DecisionTreeClassifier, plot_tree

Purpose: In this notebook we’ll introduce decision tree models. These are another class of model which can be used in both the regression and classification settings. In particular, we note that

Decision tree models are an intuitive model class which mimic human decision-making in an “If this, then that” style.
Decision trees slice the feature-space into right-rectangular prisms whose edges are perpendicular to the feature axes.
Decision trees can be used to fit highly non-linear data.
There are cases for which decision trees perform quite well and other cases where they fail spectacularly.

The Big Idea

Decision tree models begin with all observations belonging to a single “group”. Within this single group/bucket, all observations would have the same predicted response. The fitting algorithms for decision trees then ask whether we could improve our predictions by splitting this bucket into two smaller buckets of observations, each getting their own prediction. The fitting algorithm continues in this manner until predictions are no longer improved or some stopping criteria is met.

Let’s see this in action by building a decision tree classifier on some toy data with four classes.

Now that we have our data, let’s build a decision tree classifier on it.

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

## <Figure Size: (640 x 480)>

In the plot above, we see that the decision tree classifier seems to do quite well! The tree is asking yes/no questions about individual predictors (X1 or X2) which can be seen because the decision boundaries are perpendicular to those axes. In the plot below, we can see the actual structure of the decision tree.

Trees won’t always perform well, however. Indeed, if the optimal structure of the decision boundaries is not constructable via line segments perpendicular to the feature axes, we may end up requiring a very deep tree to approximate the decision boundary. A different model class is likely to be a better choice in these cases.

Consider the secondary toy dataset with two classes which is plotted below.

Now let’s try fitting a decision tree model to this data, as we did in the earlier example.

DecisionTreeClassifier()

## <Figure Size: (640 x 480)>

In the plot above, we see that the decision tree classifier is performing poorly, even though the classification problem should be quite easily! This is because the decision boundaries for a decision tree are perpendicular to those axes.

Knowing a bit about the structure of our data, what a likely decision boundary may look like, and which scenarios our model classes are best-suited for can be really helpful in making our modeling endeavors more efficient!

Some Warnings

It will be useful to beware of the following regarding decision trees.

Decision tree models are enticed to overfit by their fitting process.
The deeper a tree, or the more end-nodes it has, the more flexible the model is.
We need to use regularization techniques to constrain our trees and prevent this overfitting.
- The {tidymodels} ecosystem has been built on a pit of success (rather than pit of failure) philosophy. The idea is that it should be easy to do the right thing, and difficult to do the wrong thing. For this reason, decision trees utilize some regularization by default to prevent overfitting.
- The {sklearn} ecosystem was not built with the same pit of success philosophy. The default trees in this module are unconstrained and will overfit.

How to Implement in `{tidymodels}`

A decision tree is a model class (that is, a model specification). We define our intention to build a decision tree classifier using

dt_clf_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

Decision trees can be used for both regression and classification. For this reason, the line to set_mode() is required when declaring the model specification. The line to set_engine() above is unnecessary since rpart is the default engine. There are other available engines though.

Hyperparameters and Other Extras

Like other model classes, decision trees have tunable hyperparameters. They are

cost_complexity, which is a penalty associated with growing the tree (including additional splits).
tree_depth is an integer denoting the depth of the tree. This is the maximum number of splits between the root node and any leaf of the tree.
min_n is an integer determining the minimum number of training observations required for a node to be split further. That is, if a node/bucket contains fewer than min_n training observations, it will not be split further.

You can see the full {parsnip} documentation for decision_tree() here.

How to Implement in `{sklearn}`

A support vector classifier is a model class. We first import DecisionTreeClassifier from sklearn.tree and then create an instance of the model constructor using:

from sklearn.tree import DecisionTreeClassifier, plot_tree

dt_clf = DecisionTreeClassifier()

Hyperparameters and Other Extras

Like other model classes, decision trees have tunable hyperparameters. You are most-likely to use

ccp_alpha, which is a penalty associated with large trees. A grown tree will be pruned back to be below this threshold.
max_depth is an integer denoting the depth of the tree. This is the maximum number of splits between the root node and any leaf of the tree.
min_samples_split is an integer (or float) determining the minimum number (or proportion) of training observations required for a node to be split further. That is, if a node/bucket contains fewer than min_samples_split training observations, it will not be split further.
criterion determines how the quality of a split is measured. Options are gini, entropy, and log_loss, with gini as the default.

There are additional hyperparameters as well. You can see the full {sklearn} documentation for DecisionTreeClassifier() here.

Summary

In this notebook you were introduced to decision tree models. This is a simple class of model which is highly interpretable and is easily explained to non-experts. These models mimic our own “If this, then that” decision-making style.

Introduction to Decision Tree Classifiers

Dr. Gilbert

August 13, 2023

The Big Idea

Some Warnings

How to Implement in `{tidymodels}`

Hyperparameters and Other Extras

How to Implement in `{sklearn}`

Hyperparameters and Other Extras

Summary

Introduction to Decision Tree Classifiers

Dr. Gilbert

August 13, 2023

The Big Idea

Some Warnings

How to Implement in {tidymodels}

Hyperparameters and Other Extras

How to Implement in {sklearn}

Hyperparameters and Other Extras

Summary

How to Implement in `{tidymodels}`

How to Implement in `{sklearn}`