Assessing Classifiers

Purpose: In this notebook, our goals are to differentiate between regression and classification. In particular, we’ll recognize the following:

Regression models are built to predict or explain a numerical response.
Classification models are build to predict or explain a categorical response.
Regression models seek to fit the training data as closely as possible – your mental image might be as a “line of best fit”.
Classification models seek to separate classes – your mental image might be that classifiers seek to “draw fences”.

Toy Data for Classification

As we saw in MAT300, it is sometimes useful to have some “fake” data to play with in order to understand classification models. We’ll often call these “fake” datasets toy data. I’ve built several toy data sets below. I encourage you to play around by switching data sets and by making changes to the data sets I’ve provided.

###
#Easy two-class
num_a <- 25
num_b <- 30

set.seed(123)
x_a <- rnorm(num_a, 20, 5)
y_a <- rnorm(num_a, 20, 5)
x_b <- rnorm(num_b, 50, 5)
y_b <- rnorm(num_b, 50, 5)

toy_data <- tibble(
  x = c(x_a, x_b), y = c(y_a, y_b),
  class = c(rep("A", num_a), rep("B", num_b))) %>%
  mutate(class = factor(class))

###
#  #Second separable two-class with rare class
#  num_points <- 1000
#  x <- runif(num_points, 0, 50)
#  y <- runif(num_points, 0, 50)
#  class <- ifelse(sqrt((x - 20)^2 + (y - 40)^2) < 3,
#                  "A", "B")
# toy_data <- tibble(x = x, y = y, class = class) %>%
#   mutate(class = factor(class))

### 
# #Non-Separable three-class
# num_a <- 25
# num_b <- 40
# num_c <- 30
# 
# x_a <- rnorm(num_a, 35, 5)
# y_a <- rnorm(num_a, 25, 3)
# x_b <- rnorm(num_b, 40, 3)
# y_b <- rnorm(num_b, 35, 3)
# x_c <- rnorm(num_c, 45, 3)
# y_c <- rnorm(num_c, 25, 3)
# 
# toy_data <- tibble(x = c(x_a, x_b, x_c), y = c(y_a, y_b, y_c), class = c(rep("A", num_a), rep("B", num_b), rep("C", num_c))) %>%
#   mutate(class = factor(class))

###
# ##Two class rolled
# num_a <- 250
# num_b <- 175
# 
# x_a <- runif(num_a, 0, 40)
# x_b <- runif(num_b, 25, 60)
# y_a <- -(x_a - 25)^2 + 475 + rnorm(num_a, mean = 0, sd = 50)
# y_b <- (x_b - 40)^2 + rnorm(num_b, mean = 0, sd = 50)
# 
# toy_data <- tibble(x = c(x_a, x_b), y = c(y_a, y_b), class = c(rep("A", num_a), rep("B", num_b))) %>%
#   mutate(class = factor(class))

#####################################
##More Interesting Data Sets Below!##
#####################################

###
# #Swiss Roll
# num_a <- 250
# num_b <- 175
# 
# # Generate t values (controls the roll)
# t_a <- runif(num_a, 1.5 * pi, 4.5 * pi)  # First class (inner roll)
# t_b <- runif(num_b, 3.5 * pi, 6.5 * pi)  # Second class (outer roll)
# 
# # Radius (with some randomness)
# r_a <- t_a + rnorm(num_a, mean = 0, sd = 0.3)
# r_b <- t_b + rnorm(num_b, mean = 0, sd = 0.3)
# 
# # Convert to Cartesian coordinates (Swiss Roll)
# x_a <- r_a * cos(t_a)
# y_a <- r_a * sin(t_a)
# 
# x_b <- r_b * cos(t_b)
# y_b <- r_b * sin(t_b)
# 
# # Combine into a tibble
# toy_data <- tibble(
#   x = c(x_a, x_b),
#   y = c(y_a, y_b),
#   class = factor(c(rep("A", num_a), rep("B", num_b)))
# )

###
# # Five Class Swiss Roll
# num_points <- 150  # Per class
# 
# # Generate t values for five distinct class bands
# t1 <- runif(num_points, 1.5 * pi, 3 * pi)  
# t2 <- runif(num_points, 2.5 * pi, 4 * pi)
# t3 <- runif(num_points, 3.5 * pi, 5 * pi)
# t4 <- runif(num_points, 4.5 * pi, 6 * pi)
# t5 <- runif(num_points, 5.5 * pi, 7 * pi)
# 
# # Radius with some noise
# r1 <- t1 + rnorm(num_points, mean = 0, sd = 0.3)
# r2 <- t2 + rnorm(num_points, mean = 0, sd = 0.3)
# r3 <- t3 + rnorm(num_points, mean = 0, sd = 0.3)
# r4 <- t4 + rnorm(num_points, mean = 0, sd = 0.3)
# r5 <- t5 + rnorm(num_points, mean = 0, sd = 0.3)
# 
# # Convert to Cartesian coordinates
# x1 <- r1 * cos(t1); y1 <- r1 * sin(t1)
# x2 <- r2 * cos(t2); y2 <- r2 * sin(t2)
# x3 <- r3 * cos(t3); y3 <- r3 * sin(t3)
# x4 <- r4 * cos(t4); y4 <- r4 * sin(t4)
# x5 <- r5 * cos(t5); y5 <- r5 * sin(t5)
# 
# # Combine into a tibble
# toy_data <- tibble(
#   x = c(x1, x2, x3, x4, x5),
#   y = c(y1, y2, y3, y4, y5),
#   class = factor(rep(1:5, each = num_points))
# )

###
# # Five Class Concentric Ellipses
# num_points <- 150  
# 
# # Function to generate elliptical clusters
# generate_ellipse <- function(a, b, theta_shift, num_points, noise = 0.2) {
#   theta <- runif(num_points, 0, 2 * pi)  # Angle around the ellipse
#   x <- a * cos(theta + theta_shift) + rnorm(num_points, sd = noise)
#   y <- b * sin(theta + theta_shift) + rnorm(num_points, sd = noise)
#   tibble(x = 10*x, y = 10*y)
# }
# 
# # Define parameters for each class (major/minor axes and rotation)
# ellipse_params <- list(
#   list(a = 1,  b = 0.5,  theta_shift = 0),
#   list(a = 2,  b = 1,    theta_shift = pi/8),
#   list(a = 3,  b = 1.5,  theta_shift = pi/4),
#   list(a = 4,  b = 2,    theta_shift = pi/6),
#   list(a = 5,  b = 2.5,  theta_shift = pi/3)
# )
# 
# # Generate data for each class
# toy_data <- lapply(1:5, function(i) {
#   df <- generate_ellipse(
#     a = ellipse_params[[i]]$a,
#     b = ellipse_params[[i]]$b,
#     theta_shift = ellipse_params[[i]]$theta_shift,
#     num_points = num_points
#   )
#   df$class <- factor(i)
#   df
# }) %>% bind_rows()

###
# #Another five-class problem
# library(MASS)  # For multivariate normal sampling
# 
# # Number of points per class
# num_points <- 150  
# 
# # Define means for the five overlapping Gaussians
# means <- matrix(c(
#   0, 0, 
#   2, 2, 
#   -2, 2, 
#   -2, -2, 
#   2, -2
# ), ncol = 2, byrow = TRUE)
# 
# # Define a common covariance matrix to ensure overlap
# cov_matrix <- matrix(c(1.5, 0.8, 0.8, 1.5), ncol = 2)
# 
# # Generate Gaussian clusters
# generate_gaussian <- function(mean, num_points, cov_matrix) {
#   MASS::mvrnorm(n = num_points, mu = mean, Sigma = cov_matrix) %>%
#     as_tibble(.name_repair = "minimal")
# }
# 
# # Create dataset
# toy_data <- lapply(1:5, function(i) {
#   df <- generate_gaussian(means[i, ], num_points, cov_matrix)
#   df$class <- factor(i)
#   df
# }) %>% bind_rows() %>%
#   rename(x = ...1, y = ...2)

###
# #One More fun one!
# num_points <- 150  
# 
# # Function to generate spiral arms
# generate_spiral <- function(num_points, turns, noise, class_label) {
#   t <- seq(0, turns * pi, length.out = num_points) + rnorm(num_points, 0, noise)
#   x <- t * cos(t)
#   y <- t * sin(t)
#   tibble(x = x, y = y, class = factor(class_label))
# }
# 
# # Generate five intertwined spirals
# toy_data <- bind_rows(
#   generate_spiral(num_points, 3, 0.3, 1),
#   generate_spiral(num_points, 3, 0.3, 2) %>% mutate(x = -x, y = -y),
#   generate_spiral(num_points, 3, 0.3, 3) %>% mutate(x = y, y = x),
#   generate_spiral(num_points, 3, 0.3, 4) %>% mutate(x = -y, y = -x),
#   generate_spiral(num_points, 3, 0.3, 5) %>% mutate(x = y, y = -x)
# )

toy_data %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, 
                 color = class,
                 shape = class))

Now that we have some toy data, we’ll build several classifiers.

dt_clf <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt_rec <- recipe(class ~ x + y, data = toy_data)

dt_wf <- workflow() %>%
  add_model(dt_clf) %>%
  add_recipe(dt_rec)

dt_fit <- dt_wf %>%
  fit(toy_data)

new_data <- crossing(x = seq(min(toy_data$x) - 5, 
                             max(toy_data$x) + 5, 
                             length.out = 150),
                     y = seq(min(toy_data$y) - 5, 
                             max(toy_data$y) + 5, 
                             length.out = 150))

new_data <- dt_fit %>%
  augment(new_data)

ggplot() + 
  geom_point(data = new_data,
             aes(x = x, y = y, color = .pred_class),
             alpha = 0.05) +
  geom_point(data = toy_data,
             aes(x = x, y = y, color = class)) +
  labs(title = "Class Regions for Decision Tree",
       color = "Class")

Now a KNN Classifier…

knn_clf <- nearest_neighbor(neighbors = 3) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_rec <- recipe(class ~ x + y, data = toy_data)

knn_wf <- workflow() %>%
  add_model(knn_clf) %>%
  add_recipe(knn_rec)

knn_fit <- knn_wf %>%
  fit(toy_data)

new_data <- crossing(x = seq(min(toy_data$x) -5,
                             max(toy_data$x) + 5,
                             length.out = 150),
                     y = seq(min(toy_data$y) - 5,
                             max(toy_data$y) + 5,
                             length.out = 150))

new_data <- knn_fit %>%
  augment(new_data)

ggplot() + 
  geom_point(data = new_data,
             aes(x = x, y = y, color = .pred_class),
             alpha = 0.05) +
  geom_point(data = toy_data,
             aes(x = x, y = y, 
                 color = class,
                 shape = class)) +
  labs(title = "Class Regions for KNN (n = 3)",
       color = "Class")

And a logistic regression classifier… (Note: This one is not appropriate for cases with more than two classes.)

log_reg_clf <- logistic_reg() %>%
  set_mode("classification")

log_reg_rec <- recipe(class ~ x + y, data = toy_data)

log_reg_wf <- workflow() %>%
  add_model(log_reg_clf) %>%
  add_recipe(log_reg_rec)

log_reg_fit <- log_reg_wf %>%
  fit(toy_data)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

new_data <- crossing(x = seq(min(toy_data$x) -5,
                             max(toy_data$x) + 5,
                             length.out = 150),
                     y = seq(min(toy_data$y) - 5,
                             max(toy_data$y) + 5,
                             length.out = 150))

new_data <- log_reg_fit %>%
  augment(new_data)

ggplot() + 
  geom_point(data = new_data,
             aes(x = x, y = y, color = .pred_class),
             alpha = 0.05) +
  geom_point(data = toy_data,
             aes(x = x, y = y, 
                 color = class,
                 shape = class)) +
  labs(title = "Logistic Regression",
       color = "Class")

Note that different classifiers make different assumptions about the data. Some classifiers make probablistic or distribution assumptions, while others are distance-based. In general, different classifiers will result in different boundary structures and can make quite different predictions on new observations. This means that we are searching for the most appropriate model class and level of flexibility as we approach classification problems.

Assessing Classifier Performance

There are several performance measures for classifiers. Perhaps the most obvious measure is accuracy – the proportion of observations correctly classified by our model. We’ll encounter others as well – accuracy is not always (or perhaps even often) the most appropriate performance measure for our classifiers.

Before getting into computing accuracy and our other performance metrics, it is worth talking about a confusion matrix. Confusion matrices summarize our data and predictions – the true classes of the observations are along the columns and the predicted classes are along the rows. For example, a confusion matrix for our decision tree classifier from earlier appears below.

dt_fit %>%
  augment(toy_data) %>%
  conf_mat(class, .pred_class)

          Truth
Prediction  A  B
         A 25  0
         B  0 30

Reading the Confusion Matrix

The confusion matrix is an array of numbers whose columns correspond to the true class that an observation belongs to. The rows of the confusion matrix correspond to the observation’s predicted class according to the model. This means that

observations along the main diagonal of the confusion matrix are correctly predicted observations.
observations away from the main diagonal correspond to incorrectly predicted observations – in particular, the number in row \(i\), column \(j\) is the number of observations which are truly class \(j\) but were predicted as class \(i\).

Understanding the confusion matrix can give us great insight as to where our model is becoming “confused” and how we might hope to improve it.

Computing Accuracy

The accuracy of a classification model is the rate at which it correctly classifies observations. To compute accuracy, we sum the entries of the confusion matrix along the main diagonal and then divide by the total number of observations we made predictions for. Mathematically, we have

\[\text{accuracy} = \frac{\sum{a_{ii}}}{n}\]

where \(a_{ii}\) is the \(i^{th}\) diagonal entry of the confusion matrix and \(n\) is the total number of observations that predictions were made on.

The {tidymodels} collection of packages, specifically {yardstick}, makes it easy for us to compute accuracy and nearly any other performance metric we would want to utilize!

dt_fit %>%
  augment(toy_data) %>%
  accuracy(class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary             1

Why Not Accuracy?

To see why accuracy might not be the best measure of model performance, comment out the first data set in our initial code cell and uncomment the second toy data set – this is the one labeled second separable two-class with rare case. Run the code to generate the data and then rerun the code cells up to this point in order to construct the decision tree, KNN, and logistic regression classifiers.

Pay special attention to the plot of logistic regression classifier. We’ll print out the confusion matrix and the accuracy metric corresponding to that model here.

log_reg_fit %>%
  augment(toy_data) %>%
  conf_mat(class, .pred_class)

          Truth
Prediction  A  B
         A 25  0
         B  0 30

log_reg_fit %>%
  augment(toy_data) %>%
  accuracy(class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary             1

What happened? That level of accuracy is pretty excellent, right? Unfortunately not if we care about that rare class more than the dominant class. This is often the case, especially in medical applications where the prevalence of a relatively rare disease is not widespread and most cases are truly negative. We are perhaps more interested in an ability to identify those rare cases than we are in simply obtaining an accurate model.

Especially in situations where we are dealing with rare classes, we need other performance metrics to capture what we truly care about.

Recall

Recall measures our model’s ability to detect the presence of a class of interest from the population. That is, given that an observation belongs to class X, what is the likelihood that the model predicted class X?

To compute the recall rate for Class X, we only look at the column of the confusion matrix corresponding to Class X. The denominator is the total number of observations belonging to that column. The numerator is the diagonal entry in that column (ie. the correctly classified observations).

Again, {tidymodels}, makes computing our performance metric quite easy.

log_reg_fit %>%
  augment(toy_data) %>%
  recall(class, .pred_class, event_level = "second")

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary             1

Like accuracy, the higher the recall rate, the better the model’s ability to detect observations of the class of interest.

Precision

Precision measures the confidence we should have when our model predicts than an observation belongs to a class of interest. That is, given that the model predicted class X, what is the likelihood that the observation actually belongs to class X?

To compute the precision rate for Class X, we only look at the row of the confusion matrix corresponding to Class X. The denominator is the total number of observations belonging to that row. The numerator is the diagonal entry in that row (ie. the correctly classified observations).

Again, {tidymodels}, makes computing our performance metric quite easy.

log_reg_fit %>%
  augment(toy_data) %>%
  precision(class, .pred_class, event_level = "second")

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary             1

Like accuracy and recall, the higher the precision rate, the better the model’s ability to detect observations not belonging to the class of interest.

ROC-AUC

When dealing with binary classification, the receiver-operator-curve’s area under the curve is a measure of the increases in true positive and false positive rates when varying a decision threshold. The idea is that we could use a variety of thresholds for assigning an observation to the class of interest.

The natural threshold is that if our model predicts a 50% chance or higher, then we should assign that observation to the class of interest and otherwise assign it to the other class. This isn’t the only possible choice, however. We could use a threshold of 20%, which would allow us to capture more of the cases actually belonging to the class of interest (true positives) but would also result in more mistakes (false positives). Similarly, we could use a threshold of 80% which would result in fewer false positives but also fewer true positives. One of the goals here is to find the optimal threshold.

For ROC-AUC, the close to 1, the better our model is doing at discriminating true positives from false positives (we get nearly all of the true positives before obtaining any false positives). Values closer to 0.5 are like random guessing and indicate poor model performance. Values below 0.5 indicate that something has gone wrong because our model is doing worse than random guessing would do – in such cases it is likely that we have mixed up our classes somewhere.

log_reg_fit %>%
  augment(toy_data) %>%
  roc_curve(class, .pred_A) %>%
  autoplot()

log_reg_fit %>%
  augment(toy_data) %>%
  roc_auc(class, .pred_A)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary             1

Note that some {tidymodels} objects have an autoplot() method which will quickly construct a ggplot() displaying results in a reasonable manner. We can always construct a custom ggplot() on our own.

Other Assessment Metrics for Classifiers

While we’ll most often stick with accuracy, precision, recall, and roc_auc, it is worth knowing that there are other assessment metrics for classifiers. You can find several of them here on the {yardstick} vignette.

MultiClass Log-Loss (for your Competition)

Many of our models can be adapted to output probability estimates for each possible class rather than just a single class prediction. Some performance metrics take into account not just whether a model’s predictions are correct, but also how confident the model is in those predictions. One such metric, used in our semester-long Kaggle competition, is multiclass log loss. This performance metric is calculated as follows:

\[\text{MultiClass Log Loss} = -\frac{1}{n}\sum_{\stackrel{i = 1}{\text{predictions}}}^{n}{\sum_{\stackrel{j = 1}{\text{classes}}}^{k}{y_{ij}\log\left(\hat{y_{ij}}\right)}}\]

Where \(y_{ij} = 1\) if the true class of observation \(i\) is class \(j\), but \(y_{ij} = 0\) otherwise, and \(\hat{y_{ij}}\) is the model’s prediction that observation \(i\) belongs to class \(j\).

For each observation, only the probability assigned to the true class affects the log loss, since the summation ensures that terms corresponding to incorrect classes are multiplied by 0 and do not contribute. If the model is highly confident and correct, the predicted probability \(\hat{y_{ij}}\) will be close to 1, making \(\log\left(\hat{y_{ij}}\right)\) close to 0 resulting in a small loss contribution for that observation. However, if the model is uncertain or incorrect, \(\hat{y_{ij}}\) will be smaller, leading to a larger (more negativ) logarithm and higher loss contribution for that observation. This makes multiclass log loss particularly useful for evaluating models that output probability distributions, as it rewards well-calibrated confidence in predictions.

We can calculate mutliclass log loss using {yardstick} with the mn_log_loss() scoring function.

Summary

In this notebook we saw several examples of classification datasets, built three classifiers, and assessed them using different performance metrics. In particular, we saw classification accuracy, the rate of recall for a class of interest, the rate of precision for a class of interest, and the area under the roc-curve as performance metrics. We discussed how they are computed and showed how to calculate them easily with functionality from the {yardstick} package in {tidymodels}.

Being able to assess model performance is critical for our ability to move forward. These assessment metrics are how we will compare our models and how we will be able to determine whether our models are performing well/poorly and also improving or not.

At this point, we have all of the foundational material we need. We’ll move on to implementing specific classes of classification model and discussing what applications and use cases they are well-suited or poorly-suited to.