library(reticulate)
import pandas as pd
import numpy as np
from plotnine import ggplot, geom_point, geom_line, geom_abline, aes, labs
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve
Purpose: In this notebook, our goals are to differentiate between regression and classification. In particular, we’ll recognize the following:
As we saw in MAT300, it is sometimes useful to have some “fake” data to play with in order to understand classification models. We’ll often call these “fake” datasets toy data. I’ve built several toy data sets below. I encourage you to play around by switching data sets and by making changes to the data sets I’ve provided.
#Easy two-class
num_a = 25
num_b = 30
np.random.seed(123)
x_a = np.random.normal(loc = 20, scale = 5, size = num_a)
y_a = np.random.normal(loc = 20, scale = 5, size = num_a)
x_b = np.random.normal(loc = 50, scale = 5, size = num_b)
y_b = np.random.normal(loc = 50, scale = 5, size = num_b)
toy_data = pd.DataFrame({
"x" : np.concatenate([x_a, x_b]),
"y" : np.concatenate([y_a, y_b]),
"label" : np.concatenate([num_a*["A"], num_b*["B"]])
})
#Second separable two-class with rare class
# num_points = 1000
# x = np.random.uniform(size = num_points, low = 0, high = 50)
# y = np.random.uniform(size = num_points, low = 0, high = 50)
# label = num_points*["A"]
# for i in range(num_points):
# if np.sqrt((x[i] - 20)**2 + (y[i] - 40)**2) < 3:
# label[i] = "A"
# else:
# label[i] = "B"
# toy_data = pd.DataFrame({
# "x" : x,
# "y" : y,
# "label" : label
# })
# #Separable three-class
# num_a = 10
# num_b = 15
# num_c = 12
#
# x_a = np.random.normal(size = num_a, loc = 20, scale = 5)
# y_a = np.random.normal(size = num_a, loc = 20, scale = 3)
# x_b = np.random.normal(size = num_b, loc = 40, scale = 3)
# y_b = np.random.normal(size = num_b, loc = 40, scale = 3)
# x_c = np.random.normal(size = num_c, loc = 60, scale = 3)
# y_c = np.random.normal(size = num_c, loc = 20, scale = 3)
#
#toy_data = pd.DataFrame({
# "x" : np.concatenate([x_a, x_b, x_c]),
# "y" : np.concatenate([y_a, y_b, y_c]),
# "label" : np.concatenate([num_a*["A"], num_b*["B"], num_c*["C"]])
# })
# ##Two class rolled
# num_a = 250
# num_b = 175
#
# x_a = np.random.uniform(size = num_a, low = 0, high = 40)
# x_b = np.random.uniform(size = num_b, low = 25, high = 60)
# y_a = -(x_a - 25)**2 + 475 + np.random.normal(size = num_a, loc = 0, scale = 50)
# y_b = (x_b - 40)**2 + np.random.normal(size = num_b, loc = 0, scale = 50)
#
#toy_data = pd.DataFrame({
# "x" : np.concatenate([x_a, x_b]),
# "y" : np.concatenate([y_a, y_b]),
# "label" : np.concatenate([num_a*["A"], num_b*["B"]])
# })
(
ggplot(data = toy_data) +
geom_point(aes(x = "x", y = "y", color = "label"))
)
## <Figure Size: (640 x 480)>
Now that we have some toy data, we’ll build several classifiers.
dt_clf = DecisionTreeClassifier()
dt_clf.fit(toy_data[["x", "y"]], toy_data["label"])
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
new_x = pd.DataFrame({
"x" : np.linspace(min(toy_data["x"]) - 5, max(toy_data["x"]) + 5, num = 150)
})
new_y = pd.DataFrame({
"y" : np.linspace(min(toy_data["y"]) - 5, max(toy_data["y"]) + 5, num = 150)
})
new_grid = new_x.merge(new_y, how = "cross")
new_grid["pred_class"] = dt_clf.predict(new_grid[["x", "y"]])
(
ggplot() +
geom_point(aes(x = new_grid["x"], y = new_grid["y"], color = new_grid["pred_class"]), alpha = 0.02) +
geom_point(aes(x = toy_data["x"], y = toy_data["y"], color = toy_data["label"])) +
labs(title = "Class Regions for Decision Tree")
)
## <Figure Size: (640 x 480)>
Now a KNN Classifier…
knn_clf = KNeighborsClassifier()
knn_clf.fit(toy_data[["x", "y"]], toy_data["label"])
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
new_grid["pred_class"] = knn_clf.predict(new_grid[["x", "y"]])
(
ggplot() +
geom_point(aes(x = new_grid["x"], y = new_grid["y"], color = new_grid["pred_class"]), alpha = 0.01) +
geom_point(aes(x = toy_data["x"], y = toy_data["y"], color = toy_data["label"])) +
labs(title = "Class Regions for KNN (n = 3)", color = "label")
)
## <Figure Size: (640 x 480)>
And a logistic regression classifier… (Note: This one is not appropriate for cases with more than two classes.)
log_reg_clf = LogisticRegression()
log_reg_clf.fit(toy_data[["x", "y"]], toy_data["label"])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
new_grid["pred_class"] = log_reg_clf.predict(new_grid[["x", "y"]])
(
ggplot() +
geom_point(aes(x = new_grid["x"], y = new_grid["y"], color = new_grid["pred_class"]), alpha = 0.02) +
geom_point(aes(x = toy_data["x"], y = toy_data["y"], color = toy_data["label"])) +
labs(title = "Logistic Regression", color = "label")
)
## <Figure Size: (640 x 480)>
Note that different classifiers make different assumptions about the data. Some classifiers make probablistic or distribution assumptions, while others are distance-based. In general, different classifiers will result in different boundary structures and can make quite different predictions on new observations. This means that we are searching for the most appropriate model class and level of flexibility as we approach classification problems.
There are several performance measures for classifiers. Perhaps the most obvious measure is accuracy – the proportion of observations correctly classified by our model. We’ll encounter others as well – accuracy is not always (or perhaps even often) the most appropriate performance measure for our classifiers.
Before getting into computing accuracy and our other performance metrics, it is worth talking about a confusion matrix. Confusion matrices summarize our data and predictions – the true classes of the observations are along the rows and the predicted classes are along the columns. For example, a confusion matrix for our decision tree classifier from earlier appears below.
toy_data["pred_class_dt"] = dt_clf.predict(toy_data[["x", "y"]])
confusion_matrix(toy_data["label"], toy_data["pred_class_dt"])
## array([[25, 0],
## [ 0, 30]], dtype=int64)
The confusion matrix is an array of numbers whose rows correspond to the true class that an observation belongs to. The columns of the confusion matrix correspond to the observation’s predicted class according to the model. This means that
Understanding the confusion matrix can give us great insight as to where our model is becoming “confused” and how we might hope to improve it.
The accuracy of a classification model is the rate at which it correctly classifies observations. To compute accuracy, we sum the entries of the confusion matrix along the main diagonal and then divide by the total number of observations we made predictions for. Mathematically, we have
\[\text{accuracy} = \frac{\sum{a_{ii}}}{n}\]
where \(a_{ii}\) is the \(i^{th}\) diagonal entry of the confusion matrix and \(n\) is the total number of observations that predictions were made on.
The {sklearn}
module, specifically
sklearn.metrics
, makes it easy for us to compute accuracy
and nearly any other performance metric we would want to utilize!
accuracy_score(toy_data["label"], toy_data["pred_class_dt"])
## 1.0
To see why accuracy might not be the best measure of model performance, comment out the first data set in our initial code cell and uncomment the second toy data set – this is the one labeled second separable two-class with rare case. Run the code to generate the data and then rerun the code cells up to this point in order to construct the decision tree, KNN, and logistic regression classifiers.
Pay special attention to the plot of logistic regression classifier. We’ll print out the confusion matrix and the accuracy metric corresponding to that model here.
toy_data["pred_class_lr"] = log_reg_clf.predict(toy_data[["x", "y"]])
confusion_matrix(toy_data["label"], toy_data["pred_class_lr"])
## array([[25, 0],
## [ 0, 30]], dtype=int64)
accuracy_score(toy_data["label"], toy_data["pred_class_lr"])
## 1.0
What happened? That level of accuracy is pretty excellent, right? Unfortunately not if we care about that rare class more than the dominant class. This is often the case, especially in medical applications where the prevalence of a relatively rare disease is not widespread and most cases are truly negative. We are perhaps more interested in an ability to identify those rare cases than we are in simply obtaining an accurate model.
Especially in situations where we are dealing with rare classes, we need other performance metrics to capture what we truly care about.
Recall measures our model’s ability to detect the presence of a class of interest from the population. That is, given that an observation belongs to class X, what is the likelihood that the model predicted class X?
To compute the recall rate for Class X, we only look at the row of the confusion matrix corresponding to Class X. The denominator is the total number of observations belonging to that row. The numerator is the diagonal entry in that row (ie. the correctly classified observations).
Again, {sklearn}
, makes computing our performance metric
quite easy.
recall_score(toy_data["label"], toy_data["pred_class_lr"], pos_label = "A")
## 1.0
Like accuracy, the higher the recall rate, the better the model’s ability to detect observations of the class of interest.
Precision measures our model’s ability to discriminate against a class of interest from the population. That is, given that the model predicted class X, what is the likelihood that the observation actually belongs to class X?
To compute the precision rate for Class X, we only look at the column of the confusion matrix corresponding to Class X. The denominator is the total number of observations belonging to that column. The numerator is the diagonal entry in that column (ie. the correctly classified observations).
Again, {sklearn}
, makes computing our performance metric
quite easy.
precision_score(toy_data["label"], toy_data["pred_class_lr"], pos_label = "A")
## 1.0
Like accuracy and recall, the higher the precision rate, the better the model’s ability to detect observations not belonging to the class of interest.
When dealing with binary classification, the receiver-operator-curve’s area under the curve is a measure of the increases in true positive and false positive rates when varying a decision threshold. The idea is that we could use a variety of thresholds for assigning an observation to the class of interest.
The natural threshold is that if our model predicts a 50% chance or higher, then we should assign that observation to the class of interest and otherwise assign it to the other class. This isn’t the only possible choice, however. We could use a threshold of 20%, which would allow us to capture more of the cases actually belonging to the class of interest (true positives) but would also result in more mistakes (false positives). Similarly, we could use a threshold of 80% which would result in fewer false positives but also fewer true positives. One of the goals here is to find the optimal threshold.
For ROC-AUC, the close to 1, the better our model is doing at discriminating true positives from false positives. Values closer to 0.5 are like random guessing and indicate poor model performance. Values below 0.5 indicate that something has gone wrong because our model is doing worse than random guessing would do – in such cases it is likely that we have mixed up our classes somewhere.
lr_class_probs = log_reg_clf.predict_proba(toy_data[["x", "y"]])
roc_auc_score(toy_data["label"], lr_class_probs[:, 1])
## 1.0
fpr, tpr, thresh = roc_curve(toy_data["label"], lr_class_probs[:, 1], pos_label = "B")
(
ggplot() +
geom_line(aes(x = fpr, y = tpr), color = "blue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(x = "1 - Specificity (FPR)", y = "Sensitivity (TPR)", title = "ROC Curve (Logistic Regressor)")
)
## <Figure Size: (640 x 480)>
While we’ll most often stick with accuracy
,
precision
, recall
, and roc_auc
,
it is worth knowing that there are other assessment metrics for
classifiers. You can find
several of them here on the sklearn.metrics
documentation.
In this notebook we saw several examples of classification datasets,
built three classifiers, and assessed them using different performance
metrics. In particular, we saw classification accuracy, the
rate of recall for a class of interest, the rate of
precision for a class of interest, and the area under the
roc-curve as performance metrics. We discussed how they are
computed and showed how to calculate them easily with functionality from
the sklearn.metrics
submodule.
Being able to assess model performance is critical for our ability to move forward. These assessment metrics are how we will compare our models and how we will be able to determine whether our models are performing well/poorly and also improving or not.
At this point, we have all of the foundational material we need. We’ll move on to implementing specific classes of classification model and discussing what applications and use cases they are well-suited or poorly-suited to.