October 29, 2024
We’ve been hypothesizing, building, assessing, and interpreting regression models
\[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\]
for the past few weeks
We began with simple linear regression and multiple linear regression, where the predictors \(x_1, \cdots, x_k\) were independent, numerical predictors.
We considered strategies for adding categorical predictors to our models
We expanded our ability to fit complex relationships by considering higher-order terms
Each time we expanded our modeling abilities, we introduced additional \(\beta\)-parameters to our models
This gave us several advantages:
And some disadvantages:
There’s a question we’ve been neglecting though…
Does improved fit to our training data actually translate to a better (more accurate, more meaningful) model in practice?
Increasing the number of \(\beta\)-parameters in a model (by including dummy variables, higher-order terms, etc.) increases the flexibility of our model
This means that our model can accommodate more complex relationships – whether they are signal or noise
This also means that our model becomes more sensitive to its training data
More \(\beta\) coefficients means better training performance (higher \(R^2\), lower RMSE, etc.) in general
The more \(\beta\) coefficients we have, the greater the likelihood that we are overfitting to our training data
Question 1: If all those great strategies we’ve learned recently are increasing our risk of overfitting, should we really be using them?
Question 2: How do we know whether we are overfitting or not?
Question 3: Can we know?
We’ll look at the following items as we try to answer those three questions.
The level of bias in a model is a measure of how conservative that model is
The level of variance in a model is a measure of how much that model would change, given different training data from the same population
Let’s bring back our Linear and Eighth Degree models from the opening slide
High Bias / Low Variance:
Low Bias / High Variance:
Let’s consider what would happen if we had a new training observation at \(x = 7\), \(y = 5\)
Oh No!
On the previous slide, we saw very clearly that the model on the right had bias which was too low and variance that was too high
It is also possible for a model to have bias which is too high and variance which is too low
Consider the scenario below
Models with high bias have low variance, and vice-versa.
This is the bias/variance trade-off
Identifying an appropriate level of model flexibility means solving the bias/variance trade-off problem!
Let’s head back to the linear and eighth degree model from the opening slide again
Now that we’ve fit these models, let’s see how they perform on new data drawn from the same population
Now that we’ve fit these models, let’s see how they perform on new data drawn from the same population
Underfit!
Overfit!
bias too high, variance too low
bias too low, variance to high
Not flexible enough
Too flexible
Let’s start with a new data set, but I won’t tell you what degree association there is between \(x\) and \(y\)
We’ll fit a variety of models
And then we’ll measure their performance
Here they are…
Let’s examine the training metrics
model | degree | rsq | rmse |
---|---|---|---|
straight-line | 1 | 0.3729797 | 274.3429 |
quadratic | 2 | 0.4081391 | 266.5402 |
cubic | 3 | 0.7299959 | 180.0272 |
5th-Order | 5 | 0.7304174 | 179.8866 |
11th-order | 11 | 0.7404520 | 176.5070 |
Let’s examine the training metrics
model | degree | rsq | rmse |
---|---|---|---|
straight-line | 1 | 0.3729797 | 274.3429 |
quadratic | 2 | 0.4081391 | 266.5402 |
cubic | 3 | 0.7299959 | 180.0272 |
5th-Order | 5 | 0.7304174 | 179.8866 |
11th-order | 11 | 0.7404520 | 176.5070 |
Performance gets better as flexibility increases!
Let’s do the same with the test metrics
model | degree | rsq | rmse |
---|---|---|---|
straight-line | 1 | 0.3045938 | 296.2297 |
quadratic | 2 | 0.2931054 | 291.3515 |
cubic | 3 | 0.6266876 | 189.1693 |
5th-Order | 5 | 0.6205745 | 190.5977 |
11th-order | 11 | 0.5915634 | 196.9832 |
Let’s do the same with the test metrics
model | degree | rsq | rmse |
---|---|---|---|
straight-line | 1 | 0.3045938 | 296.2297 |
quadratic | 2 | 0.2931054 | 291.3515 |
cubic | 3 | 0.6266876 | 189.1693 |
5th-Order | 5 | 0.6205745 | 190.5977 |
11th-order | 11 | 0.5915634 | 196.9832 |
The training and test RMSE values largely agree over the lowest three levels of model flexibility, but…
Test performance gets worse with additional flexibility beyond third degree!
\(\bigstar \bigstar\) Main Takeaway \(\bigstar \bigstar\) The computer’s job is to find the best \(\beta\)-coefficients by minimizing training error, but our job (as modelers) is the find the best model by minimizing test error.
Here’s the code I used to generate our toy dataset…
That’s a third-degree association!
Solving the Bias/Variance TradeOff Problem:
We can identify the appropriate level of model flexibility by finding the location of the bend in the elbow plot of test performance
Bias and Variance are two competing measures on models
Model variance refers to how much our model may change if provided a different training set from the same population
Models with high variance are more flexible and are more likely to overfit
Models with low variance are less flexible and are more likely to underfit
A model is overfit if it has learned too much about its training data and the model performance doesn’t generalize to unseen or new data
A model is underfit if it is not flexible enough to capture the general trend between predictors and response
We solve the bias/variance trade-off problem by finding the level of flexibility at which test performance is best (the bend in the elbow plot)
Remember that greater levels of flexibility are associated with
If the improvements in model performance are small enough that they don’t outweigh these risks, we should choose the simpler (more parsimonious) model even if it doesn’t have the absolute best performance on the unseen test data
Cross-Validation