Purpose: In this notebook, we introduce the class of model known as logistic regression. In particular, we note the following.

Toy Data

In order to have an example to work with, we’ll develop a toy dataset. Let’s say that we are able to weights of two species of frog, and that the weights are normally distributed in each population. The weights in Species A follow \(W_A \sim N\left(\mu = 40, \sigma = 3\right)\) and the weights in Species B follow \(W_B \sim N\left(\mu = 55, \sigma = 7\right)\).

We’ll simulate drawing \(100\) frogs from each population and recording their weights. The results appear below. Note that in the plot on the left, the vertical position of the observed data points is meaningless – some noise has been added so that the observed frog weights are discernible from one another.

Our goal then is to fit a model that will help us determine whether a frog with a given weight is most likely to belong to Species A or Species B.

Why Not Linear Regression?

Linear regression techniques are problematic here because of the nature of polynomial functions. They are [nearly] all unbounded – that is, eventually, any non-constant polynomial will escape the interval \(\left[0, 1\right]\) and predicted values will run off towards positive or negative infinity. You can see this below where we are fitting and plotting several linear regression models to the simulated frog weight data.

Note that these polynomial-form linear regression models all eventually escape the \(\left[0, 1\right]\) interval. The outputs from these models cannot be interpreted as probabilities.

Logistic Regression

A logistic regression model takes the form \(\displaystyle{\mathbb{P}\left[y = 1\right] = \frac{e^{\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k}}{1 + e^{\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k}}}\). The outputs from such a model are bounded to the interval \(\left[0, 1\right]\). As a result, we can interpret the outputs from such a model as the probability that an observation belongs to the class labeled by 1.

Below we fit and plot a logistic regression model.

Properties

Logistic regression models are linear models since we can show that an equivalent form for the model is \[\log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k\]

This means that the decision boundary for a logistic regression model corresponds to the situation where \(p = 1 - p\) (or \(p = 0.5\)). This would result in the equation \(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k = 0\), a line/plane/hyperplane.

Interpretations and Marginal Effects

Since the linear form of a logistic regression model outputs the log-odds of belonging to the class labeled by 1 and not the probability of belonging to that class, it can be difficult to interpret logistic regression models. In particular,

  • If \(x_i\) is a numeric predictor, then \(\beta_i\) is the expected impact on the log-odds of an observation belonging to the class labeled by 1 due to a unit increase in \(x_i\).

    • Note that this is not the same as the effect on the probability of an observation belonging to the class labeled by 1.
  • In order to obtain the expected effect of a unit increase in \(x_i\) on the probability of an observation belonging to the class labeled by 1, we would need to compute the partial derivative of the original form of our logistic regression model with respect to \(x_i\). Multivariable calculus helps with this, but so does the {marginaleffects} package.

term estimate std.error statistic p.value
(Intercept) -23.2702554 3.5205587 -6.609819 0
weight 0.5175735 0.0806284 6.419245 0

The estimate attached to weight in the table above is the estimated increase in log-odds of belonging to Species B. This is difficult to interpret other than to say that heavier frogs are more likely to belong to Species B because the coefficient on weight is positive. We can use the slopes() function from the {marginaleffects} to compute the partial derivative of the logistic regression model with respect to weight at various levels of the weight variable.

term weight estimate conf.low conf.high std.error
weight 40.25424 0.0383405 0.0196259 0.0570551 0.0095485
weight 41.27119 0.0582201 0.0337657 0.0826745 0.0124770
weight 42.28814 0.0830275 0.0509088 0.1151462 0.0163874
weight 43.30508 0.1083282 0.0684017 0.1482547 0.0203711
weight 44.32203 0.1259488 0.0837024 0.1681952 0.0215547
weight 45.33898 0.1281450 0.0924544 0.1638355 0.0182098
weight 46.35593 0.1138380 0.0860349 0.1416411 0.0141855
weight 47.37288 0.0896055 0.0614667 0.1177444 0.0143568
weight 48.38983 0.0640863 0.0341911 0.0939814 0.0152529
weight 49.40678 0.0427788 0.0151700 0.0703875 0.0140863

The output above shows estimates for the marginal effects of a unit increase in weight. We can see that the marginal effect is largest near a weight of about 45 units. For weights less than 35 units and more than 55 units, there is little change in probability that the corresponding observation belongs to Species B. Plots like the one above can be really helpful in understanding what our model suggests is the association between our predictor(s) and response.

How to Implement in {tidymodels}

A logistic regressor is a model class (that is, a model specification). We define our intention to build a logistic regressor using

log_reg_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Hyperparameters and Other Extras

The lines to set_engine() and set_mode() are unnecessary since these are the defaults. I include them here just so that we continue to be reminded that setting an engine and setting a mode are typically things that we will need to do.

Depending on the fitting engine chosen, logistic regression models have two (2) tunable hyperparameters. They are

  • penalty, which is a model constraint penalizing the inclusion of multiple model terms and large coefficients.

    • Typical penalty values are powers of 10 – for example 1e-3, 0.01, 0.1, 1, 10.
    • Remember to scale all of your numerical predictors if you use this parameter. Otherwise some predictors are artificially cheap or expensive depending on the magnitude of their raw values.
  • mixture is a number between \(0\) and \(1\) which determines the type of regularization to use. Basically, this governs how our “spent coefficient budget is computed”.

    • mixture = 0 results in L1 regularization (Ridge Regression)
    • mixture = 1 results in L2 regularization (LASSO)
    • mixture values between \(0\) and \(1\) are a mixture of Ridge and LASSO, where the value provided represents the proportion of the budget calculation corresponding to the LASSO.
    • As a reminder, you can read more about Ridge Regression, the LASSO, and how these approaches work here.

You can see the full {parsnip} documentation for logistic_reg() here.


Summary

In this notebook, we were introduced to the class of logistic regression models for classification. Logistic regressors are a sort of hybrid regression/classification model because they output a numerical values – interpreted as the probability that an observation belongs to the class labeled by 1. We saw two different forms for logistic regression models and we saw how we can interpret logistic regression models. In particular, we saw how we can use the {marginaleffects} package to help us understand what our model tells us about the association between predictor(s) and our response.