Utilizing Categorical Predictors

Published

August 3, 2024

library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(patchwork)
library(kableExtra)
tidymodels_prefer()

options(kable_styling_bootstrap_options = c("hover", "striped"))

theme_set(theme_bw(base_size = 14))

penguins <- palmerpenguins::penguins

set.seed(123)
penguins_split <- initial_split(penguins)
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)

Objectives

In our recent notebooks, we’ve attempted to predict the body mass of penguins. In doing so, we’ve utilized only numerical features as predictors. In this notebook, we’ll extend our ability to build models by adding categorical predictors to our repertoire. After working through this notebook, you should be able to:

  • Interpret a series of dummy variable columns as they relate to a categorical variable in a data set.
  • Identify the base-level of a categorical variable by looking at its corresponding dummy variable columns.
  • Use step_dummy() in conjunction with a recipe() to create dummy variables from a categorical variable.
  • Interpret the coefficient of a term involving a dummy variable in a model.

Linear Regression Models

As a reminder, linear regression models are of the form \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 +\cdots + \beta_k\cdot x_k\). This works just fine when \(x_1,~x_2,~\cdots,~x_k\) are numerical variables – we can certainly multiply and add numbers together. What happens, however, when we want to use categorical variables in a model. What might the meaning of \(2\cdot\left(\text{gentoo}\right)\) be?

Motivating Use of Categorical Variables

Categorical variables can be really valuable predictors. Leaving them out of models can leave lots of predictive and explanatory power on the table. Let’s just look at a plot of penguin body mass by species to really understand the value of species as a predictor.

penguins_train %>%
  ggplot() + 
  geom_boxplot(aes(x = species, y = body_mass_g)) +
  labs(title = "Body Mass by Species",
       x = "",
       y = "Body Mass (g)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Visually, we see that the Gentoo penguins are much more massive than the Adelie and Chinstrap penguins.

Categorical Variables and Dummy Variables

Categories can’t be plugged into linear regression models directly, because linear regression models rely on the operations of multiplication and addition to predict a response. For this reason, we’ll need a way to convert categorical variables into a numerical alternative. There are lots of ways this can be done, but the simplest is to create a corresponding set of dummy variables. These dummy variables are like light switches – they can be turned on or off, and at most one light switch can be turned on at a time. See the example below:

penguins_train %>%
  head() %>%
  select(species) %>%
  mutate(species_gentoo = ifelse(species == "Gentoo", 1, 0),
         species_adelie = ifelse(species == "Adelie", 1, 0),
         species_chinstrap = ifelse(species == "Chinstrap", 1, 0)) %>%
  kable() %>%
  kable_styling()
species species_gentoo species_adelie species_chinstrap
Gentoo 1 0 0
Adelie 0 1 0
Gentoo 1 0 0
Chinstrap 0 0 1
Adelie 0 1 0
Chinstrap 0 0 1

Notice that the species_gentoo column indicates via a \(1\) that the penguin is a Gentoo, or uses a \(0\) to indicate that the penguin is not a Gentoo. The other columns are similar. Because no penguin can be of more than one species, and each penguin is exactly one of Gentoo, Adelie, or Chinstrap, this encoding does exactly what we want. We can plug \(0\)s and \(1\)s into our regression model. Furthermore, we don’t need all of these new dummy variables, since knowing that a penguin is not a Gentoo and is not an Adelie guarantees that the penguin is a Chinstrap. We can see this below.

penguins_train %>%
  head() %>%
  select(species) %>%
  mutate(species_gentoo = ifelse(species == "Gentoo", 1, 0),
         species_chinstrap = ifelse(species == "Chinstrap", 1, 0)) %>%
  kable() %>%
  kable_styling()
species species_gentoo species_chinstrap
Gentoo 1 0
Adelie 0 0
Gentoo 1 0
Chinstrap 0 1
Adelie 0 0
Chinstrap 0 1

I’ve removed the Adelie category, but the idea is the same. That penguin in the second row is not a Gentoo and is not a Chinstrap, so it must be an Adelie. We don’t need the original species column to recover that information.

I removed the species_Adelie dummy variable because the Adelie level comes first alphabetically. This is the same choice R will make unless we tell it to do otherwise. We’ll highlight reasons you might make other choices later in this notebook. For now, let’s see how to create dummy variables the tidymodels way instead of this manual method I’ve just used.

Building a Model With Categorical Variables

The way we build models that include categorical variables is exactly the same way as we built simple and multiple linear regression models in the notebooks prior. The only change we’ll be making is that we are going to add a step_*() to our recipe(). The tidymodels framework includes lots of step_*() functions which allow us to transform our predictors. The step_dummy() function allows us to create dummy variables in place of a categorical predictor, and it is the first of the step_*() functions we’ll encounter.

Fitting A Model With A Categorical Predictor

In our last notebook, we ended with a model that included only flipper_length_mm and year as significant predictors of body_mass_g. Let’s go back to the “full” model, including all of the numerical predictors as well as the species variable as a starting point.

mass_reg_spec <- linear_reg() %>%
  set_engine("lm")

mass_reg_rec <- recipe(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + year + species, data = penguins_train) %>%
  step_dummy(species)

mass_reg_wf <- workflow() %>%
  add_model(mass_reg_spec) %>%
  add_recipe(mass_reg_rec)

mass_reg_fit <- mass_reg_wf %>%
  fit(penguins_train)

There it is – just like before! The only difference is the additional step_dummy() in our recipe().

Assessing Our Fitted Model

Now that we have our fitted model, let’s assess it and reduce the model if necessary.

mass_reg_fit %>%
  glance() %>%
  kable() %>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.850892 0.847299 320.6405 236.8218 0 6 -1836.902 3689.803 3718.165 25599765 249 256

The p.value for our global test of model utility is very small, indicating that at least one of the variables in our model is a statistically significant predictor of body_mass_g. Now let’s move to a term-by-term analysis.

mass_reg_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable() %>%
  kable_styling()
term estimate std.error statistic p.value
(Intercept) 117741.34928 52923.855404 2.224731 0.0269960
bill_length_mm 45.68811 8.453182 5.404841 0.0000002
bill_depth_mm 116.33674 22.899361 5.080349 0.0000007
flipper_length_mm 24.73435 3.853005 6.419495 0.0000000
year -61.06663 26.441028 -2.309541 0.0217324
species_Chinstrap -611.87027 98.338482 -6.222084 0.0000000
species_Gentoo 674.46469 174.809962 3.858274 0.0001455

When accounting for different baseline body mass per species, all of the other available features are statistically significant! This highlights an important idea – just because a predictor is not statistically significant doesn’t mean that it is unrelated to the response variable or is not useful in making predictions. A visual representation of confidence intervals for our fitted model coefficients appears below.

mass_reg_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  ggplot() +
  geom_errorbarh(aes(xmin = estimate - 2*std.error, 
                     xmax = estimate + 2*std.error,
                     y = term,
                     color = term),
                 show.legend = FALSE) + 
  geom_point(aes(x = estimate,
                 y = term,
                 color = term),
             show.legend = FALSE) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             lwd = 1.5) +
  labs(title = "Model Coefficients",
       x = "Coefficient Range",
       y = "")

All of our predictors were useful were useful, we just needed to account for the differences in species to see that!

Interpreting Our Model

Since we’ve got our model now and all of the predictors are statistically significant, let’s draw some inferences from it.

mass_reg_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  kable() %>%
  kable_styling()
term estimate std.error statistic p.value
(Intercept) 117741.34928 52923.855404 2.224731 0.0269960
bill_length_mm 45.68811 8.453182 5.404841 0.0000002
bill_depth_mm 116.33674 22.899361 5.080349 0.0000007
flipper_length_mm 24.73435 3.853005 6.419495 0.0000000
year -61.06663 26.441028 -2.309541 0.0217324
species_Chinstrap -611.87027 98.338482 -6.222084 0.0000000
species_Gentoo 674.46469 174.809962 3.858274 0.0001455

We can write our estimated model as follows: \[\begin{align}\mathbb{E}\left[\text{body\_mass\_g}\right] \approx &117741.35 + 45.69\left(\text{bill\_length\_mm}\right) + 116.34\left(\text{bill\_depth\_mm}\right) +\\ &24.73\left(\text{flipper\_length\_mm}\right) - 61.07\left(\text{year}\right) - 611.87\left(\text{species\_Chinstrap}\right) +\\ &674.46\left(\text{species\_Gentoo}\right)\end{align}\]

We can interpret each of these terms just like in our simple or multiple linear regression models. For example, holding all other variables constant, an increase in flipper length by \(1\) millimeter is associated with an expected additional \(24.73\) grams of penguin body mass, on average. The new thing here is how we’ll interpret the coefficients on those dummy variables for species.

Notice that the Adelie species is missing from our model. As we discussed earlier, this means that Adelie is the base level for the species variable. Because of this, we’ll interpret all of the other levels with respect to the Adelie penguins.

  • Since the coefficient on species_Chinstrap is \(-611.87\), we expect that the difference in body mass between otherwise similar Adelie and Chinstrap penguins will be that the Chinstrap has a mass of \(611.87\) grams less, on average.
  • Since the coefficient on species_Gentoo is \(647.46\), we expect the difference in body mass between otherwise similar Adelie and Gentoo penguins will be that the Gentoo has a mass of \(647.46\) grams more, on average.
  • From the coefficients we can infer that, on average, Chinstrap penguins are the smallest, while Gentoos are the largest.

Making Predictions and Assessing Performance

As a reminder, the Adjusted R Squared (adj.r.squared) and Residual Standard Error (sigma) metrics computed for the overall model are biased. This is the case because they are computed using the training data, where the model is able to see the true values of the response. Let’s compute R Squared and the root mean squared error (rmse) on both the training and test sets.

my_metrics <- metric_set(rsq, rmse)

(mass_reg_fit %>%
    augment(penguins_train) %>%
    select(body_mass_g, .pred) %>%
    my_metrics(body_mass_g, .pred) %>%
    mutate(type = "training")
) %>% 
  bind_rows(
    mass_reg_fit %>%
      augment(penguins_test) %>%
      select(body_mass_g, .pred) %>%
      my_metrics(body_mass_g, .pred) %>%
      mutate(type = "test")
  ) %>%
  pivot_wider(id_cols = .metric,
              names_from = type,
              values_from = .estimate) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
.metric training test
rsq 0.850892 0.8428805
rmse 316.226315 297.3192546

We can see that the R Squared metric is similar across the training and test sets. As we noted in the multiple linear regression notebook, the test RMSE is slightly lower than the training RMSE. Again, this is due to our test set containing mainly typical penguins, and fewer difficult penguins to predict body mass for.

Note, however, that this new model is an enormous improvement over our previous models. In terms of R Squared, our model now explains around 85% of the variation in penguin body mass. In terms of predictive value, our model is now expected to make predictions accurate to within about \(\pm 595\) to \(\pm 632\) grams. These are big improvements over our multiple linear regression model which did not account for the penguin species.


Summary

With respect to coding, the only new thing we introduced in this notebook was the use of step_dummy() in conjunction with a recipe(). We can use step_dummy() with any categorical variable in a model. It will create a set of dummy variables which can be utilized in a model – the dummy variables are numeric indicators for each level of a categorical variable. One level will not correspond to a dummy variable and that level will be the base level. All of the model coefficients attached to dummy variables can be interpreted with respect to the base level.

Now you can use both numerical and categorical predictors in your regression models!