Variable Selection Methods: Regularization with Ridge and LASSO
Published
August 17, 2024
Recap
With Cross-Validation, we’ve supercharged our modeling powers. We’ve made our model performance estimates much more reliable, which gives us greater confidence in understanding the predictive power (or lack-thereof) in our models. We’ve become much more responsible modelers, in this sense. We’ll see again soon that cross-validation can actually do much more for us than just provide more reliable performance estimates than the validation-set approach did. For now though, we shift focus slightly to the question – how do we know which predictors should be included in our model?
Motivation
When we first introduced multiple linear regression, we took an approach in which we built a “large” model – a model that included most/all of our available predictors. Once we had that model, we used a procedure called backward elimination to eliminate terms from our model if they were associated with \(p\)-values above the 5% significance threshold. We could have taken the opposite approach, starting with an empty model, adding predictors/terms one at a time – a process called forward selection.
With both of these methods, we’re allowing our model to be quite greedy. It may be helpful to think of these processes as allowing our model to go “shopping” for predictors – in the backward selection paradigm, the model begins by adding every item in the store to its shopping cart, and then carefully places the ones it doesn’t want back on the shelves – in the forward selection paradigm, the model begins with an empty cart and adds its favorite items one-by-one into its shopping cart. This seems quite reasonable at first, but we’re engaging in quite risky behavior.
From a purely statistical standpoint, we’re evaluating lots of \(t\)-tests in determining whether model terms are statistically significant or not. The likelihood of making at least one Type I Error (saying that a model term is statistically significant, when it is not so in the population) becomes very large in these processes.
From a model-fit perspective, the more predictor variables a model has, the more flexible it is – the more flexible a model is, the better it will fit the training data and the more likely it is to become overfit.
By allowing our model to “shop” freely for its predictors, we are enticing our model to overfit the training data – giving our model a “budget” to spend on its shopping trip would be a really nice way to lower the likelihood that our model “buys” too many predictors and overfits the training data.
Optimization Procedures for Model Fitting
All of the models we’ve fit so far in our course use Ordinary Least Squares as the underlying fitting procedure – we are moving to new waters now. Ridge Regression and LASSO models belong to a class of model called Generalized Linear Models, although their fitting procedures under the hood will be different, the consistency of the {tidymodels} framework allows us to make this change quite simply.
Regularization
The process of applying an additional constraint such as a budget one example of model constraints commonly referred to as regularization. Regularization techniques are utilized with lots of model classes to help prevent those models from overfitting our training data.
Objectives
After working through this notebook you should be able to:
Articulate the backward selection approach and how it relates to the multiple linear regression models we’ve constructed and analyzed so far in our course.
Discuss forward selection as an alternative to the backward selection process.
Discuss and implement Ridge Regression and the LASSO as alternatives to Ordinary Least Squares within the {tidymodels} framework.
Discuss the benefits and drawbacks for forward- and backward-selection, Ridge Regression, and the LASSO relative to one another.
Ordinary Least Squares
The process of fitting a model using ordinary least squares is an optimization problem. Simply put, to fit a model of the form
\[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] we minimize the residual sum of squares:
This process of choosing values for \(\beta_0,~\beta_1,~\beta_2,~\cdots,~\beta_k\) is called Ordinary Least Squares (OLS), and it’s what we’ve been using (or, rather R has been using) all semester long to fit our models.
In OLS, the optimization procedure can freely choose those \(\beta\) values, without any constraint (OLS is shopping without a budget). Ridge Regression and the LASSO introduce extra constraints on this optimization problem.
Ridge Regression
The process of fitting a model using Ridge Regression is similar to that of using OLS, except that a constraint on the chosen \(\beta\) coefficients is imposed. That is, using Ridge Regression to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem:
Where \(C\) can be thought of as a total coefficient budget. That is, it becomes very expensive for the model to assign lots of non-zero coefficients.
Least Absolute Shrinkage and Selection Operator (LASSO)
The difference between Ridge Regression and the LASSO is in how “spent budget” is calculated. Instead of summing the squares of the \(\beta\) coefficients, we’ll sum their absolute values. That is, using LASSO to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem: \[\begin{array}{ll}\text{Minimize:} & \displaystyle{\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}}\\
\text{Subject To:} & \displaystyle{\sum_{i=1}^{k}{\left|\beta_i\right|} \leq C}\end{array}\]
Where \(C\) can again be thought of as a total coefficient budget. Because it squares the \(\beta\) coefficients in the calculation of “budget used”. Furthermore, due to the mathematics behind these algorithms, models fit using the LASSO result in small coefficients being sent to \(0\). That is, the LASSO can be used as a variable selection procedure. Models fit with Ridge Regression often leave several predictors having very small coefficients so, while they don’t have much influence in the overall model predictions, they do still remain in the model.
Some Feature PreProcessing Concerns
If we are going to utilize Ridge Regression or the LASSO, we need to ensure that all of our predictors are on a standardized scale. Otherwise, predictors/features whose observed values are large are superficially cheaper for the model to use (because we can attach small coefficients to them) than features whose observed values are already small. Similarly, features whose observed values are very small are artificially made more expensive for the model to use because they demand larger coefficients in order to have influence over model predictions. There are two very common scaling methods:
Min/Max scaling projects a variable onto the interval \(\left[0, 1\right]\), where the minimum observed value is sent to \(0\) and the maximum observed value is sent to \(1\).
Min/Max scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \min\left(x\right)}{\max\left(x\right) - \min\left(x\right)}}\).
We can add step_range() to a recipe() to min/max scale a feature.
Standardization converts a variables raw measurements into standard deviations (\(z\)-scores), where the mean of the transformed variable is \(0\) and the standard deviation of the transformed variable is \(1\).
Standardized scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \text{mean}\left(x\right)}{\text{sd}(x)}}\)
We can add step_normalize() to a recipe() to standardize the values of a feature.
Fitting a Model Using Ridge Regression or the LASSO
The tidymodels framework provides us with a standardized structure for defining and fitting models. Everything we’ve learned about the syntax and workflow for fitting a linear regression model using OLS can be used to fit these Generalized Linear Models. We’ll just need to set_engine() to something other than "lm" – specifically, an engine which can fit models using the constraints defined in the Ridge Regression and LASSO sections above. We’ll use "glmnet", which requires two additional arguments mixture and penalty.
The mixture argument is a number between \(0\) and \(1\). Setting mixture = 1 results in a LASSO model, while setting mixture = 0 results in a Ridge Regression model. Values in between \(0\) and \(1\) result in a mixed LASSO/Ridge approach, where the penalty is partially determined by the Ridge Regression constraint and partially by the LASSO constraint.
The penalty argument corresponds to the amount of regularization being applied – that is, the penalty is related to the budget parameter we’ve described earlier.
Ummm…Great – So What Do I Use and When Do I Use It?
There is some really good news coming – just bear with me!
The choice between Ridge Regression and LASSO depends on your goals. If you are looking for a tool to help with variable selection, then the LASSO is your choice, since it will result in less valuable predictors being assigned coefficients of exactly \(0\). If you are building a predictive model, you might try both and see which one performs better for your specific use-case.
How do I choose a penalty? As I understand it, there’s no great science to choosing a penalty. People will typically try values and see what the best ones are for their particular use-case.
So basically I’ve told you about these two new classes of model that help prevent overfitting. They each require a penalty parameter and I’ve given you no real guidance on how to choose it. For now, we’ll simply choose a penalty parameter and mention what we could build several models with different penalty values to try and find one that performs best. I’ll give you a much better method when we talk about hyperparameter tuning.
Implementing Ridge Regression and the LASSO
There are a few additional concerns with fitting Ridge Regression and LASSO models that we haven’t needed to deal with prior.
The glmnet engine requires that no missing values are included in any fold.
We can omit rows with missing data. (drawback: we are throwing away observations and we may be drastically reducing the size of our available data)
We can omit features with missing data. (drawback: we are sacrificing potentially valuable predictors by omitting them from our model)
We can impute missing values using a step_impute_*() feature engineering step in our recipe. (drawback: we are making our best guess at what the missing value should be, but we are introducing additional uncertainty to our model)
Because the penalty parameter imposes a budget on coefficients, all of our predictors must be on the same scale – otherwise, some predictors are artificially cheaper or more expensive than others to include in our model.
We can use step_range() or step_normalize() on all_numerical_predictors() to achieve this.
We’ll filter out the observations with an unknown response (Sale_Price), and we’ll use step_impute_knn() to use a \(k\)-nearest neighbors imputation scheme for any missing values in our predictor columns. Finally, we’ll use step_normalize() to center and scale our numerical predictors. Then we’ll proceed as usual.
Implementing Ridge Regression
We’ll build and assess a Ridge Regression model first.
From the regression output above, we can see some predictors getting large coefficients, while others recieve coefficients on a smaller scale.
Implementing the LASSO
The work required to construct a LASSO model is nearly identical. We simply use mixture = 1 in the model specification to signal that we want to use the LASSO constraint instead of the ridge constraint.
We can see in the LASSO model that many of the predictors were assigned coefficients of \(0\). The LASSO procedure identified that those predictors weren’t worth “buying”, so it left them out and only included the most useful predictors!
Summary
In this notebook, we introduced two new regression models: Ridge Regression and the LASSO. These models are of the familiar form \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\), but are fit using a constrained optimization procedure rather than ordinary least squares. When we implement Ridge and the LASSO, we are reducing the likelihood that the resulting model is overfit. With these models, however, we must remember that all numerical predictors need to be scaled and no missing data can be present in the training data.