Variable Selection Methods: Regularization with Ridge and LASSO

Published

August 17, 2024

Recap

With Cross-Validation, we’ve supercharged our modeling powers. We’ve made our model performance estimates much more reliable, which gives us greater confidence in understanding the predictive power (or lack-thereof) in our models. We’ve become much more responsible modelers, in this sense. We’ll see again soon that cross-validation can actually do much more for us than just provide more reliable performance estimates than the validation-set approach did. For now though, we shift focus slightly to the question – how do we know which predictors should be included in our model?

Motivation

When we first introduced multiple linear regression, we took an approach in which we built a “large” model – a model that included most/all of our available predictors. Once we had that model, we used a procedure called backward elimination to eliminate terms from our model if they were associated with \(p\)-values above the 5% significance threshold. We could have taken the opposite approach, starting with an empty model, adding predictors/terms one at a time – a process called forward selection.

With both of these methods, we’re allowing our model to be quite greedy. It may be helpful to think of these processes as allowing our model to go “shopping” for predictors – in the backward selection paradigm, the model begins by adding every item in the store to its shopping cart, and then carefully places the ones it doesn’t want back on the shelves – in the forward selection paradigm, the model begins with an empty cart and adds its favorite items one-by-one into its shopping cart. This seems quite reasonable at first, but we’re engaging in quite risky behavior.

  • From a purely statistical standpoint, we’re evaluating lots of \(t\)-tests in determining whether model terms are statistically significant or not. The likelihood of making at least one Type I Error (saying that a model term is statistically significant, when it is not so in the population) becomes very large in these processes.
  • From a model-fit perspective, the more predictor variables a model has, the more flexible it is – the more flexible a model is, the better it will fit the training data and the more likely it is to become overfit.

By allowing our model to “shop” freely for its predictors, we are enticing our model to overfit the training data – giving our model a “budget” to spend on its shopping trip would be a really nice way to lower the likelihood that our model “buys” too many predictors and overfits the training data.

Optimization Procedures for Model Fitting

All of the models we’ve fit so far in our course use Ordinary Least Squares as the underlying fitting procedure – we are moving to new waters now. Ridge Regression and LASSO models belong to a class of model called Generalized Linear Models, although their fitting procedures under the hood will be different, the consistency of the {tidymodels} framework allows us to make this change quite simply.

Regularization

The process of applying an additional constraint such as a budget one example of model constraints commonly referred to as regularization. Regularization techniques are utilized with lots of model classes to help prevent those models from overfitting our training data.

Objectives

After working through this notebook you should be able to:

  • Articulate the backward selection approach and how it relates to the multiple linear regression models we’ve constructed and analyzed so far in our course.
  • Discuss forward selection as an alternative to the backward selection process.
  • Discuss and implement Ridge Regression and the LASSO as alternatives to Ordinary Least Squares within the {tidymodels} framework.
  • Discuss the benefits and drawbacks for forward- and backward-selection, Ridge Regression, and the LASSO relative to one another.

Ordinary Least Squares

The process of fitting a model using ordinary least squares is an optimization problem. Simply put, to fit a model of the form

\[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] we minimize the residual sum of squares:

\[\text{Minimize:}~~\sum_{\text{training data}}{\left(y_{\text{observed}} - y_{\text{predicted}}\right)^2}\]

Which is equivalent to

\[\text{Minimize:}~\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}\]

This process of choosing values for \(\beta_0,~\beta_1,~\beta_2,~\cdots,~\beta_k\) is called Ordinary Least Squares (OLS), and it’s what we’ve been using (or, rather R has been using) all semester long to fit our models.

In OLS, the optimization procedure can freely choose those \(\beta\) values, without any constraint (OLS is shopping without a budget). Ridge Regression and the LASSO introduce extra constraints on this optimization problem.

Ridge Regression

The process of fitting a model using Ridge Regression is similar to that of using OLS, except that a constraint on the chosen \(\beta\) coefficients is imposed. That is, using Ridge Regression to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem:

\[\begin{array}{ll}\text{Minimize:} & \displaystyle{\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}}\\ \text{Subject To:} & \displaystyle{\sum_{i=1}^{k}{\beta_i^2} \leq C}\end{array}\]

Where \(C\) can be thought of as a total coefficient budget. That is, it becomes very expensive for the model to assign lots of non-zero coefficients.

Least Absolute Shrinkage and Selection Operator (LASSO)

The difference between Ridge Regression and the LASSO is in how “spent budget” is calculated. Instead of summing the squares of the \(\beta\) coefficients, we’ll sum their absolute values. That is, using LASSO to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem: \[\begin{array}{ll}\text{Minimize:} & \displaystyle{\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}}\\ \text{Subject To:} & \displaystyle{\sum_{i=1}^{k}{\left|\beta_i\right|} \leq C}\end{array}\]

Where \(C\) can again be thought of as a total coefficient budget. Because it squares the \(\beta\) coefficients in the calculation of “budget used”. Furthermore, due to the mathematics behind these algorithms, models fit using the LASSO result in small coefficients being sent to \(0\). That is, the LASSO can be used as a variable selection procedure. Models fit with Ridge Regression often leave several predictors having very small coefficients so, while they don’t have much influence in the overall model predictions, they do still remain in the model.

Some Feature PreProcessing Concerns

If we are going to utilize Ridge Regression or the LASSO, we need to ensure that all of our predictors are on a standardized scale. Otherwise, predictors/features whose observed values are large are superficially cheaper for the model to use (because we can attach small coefficients to them) than features whose observed values are already small. Similarly, features whose observed values are very small are artificially made more expensive for the model to use because they demand larger coefficients in order to have influence over model predictions. There are two very common scaling methods:

  • Min/Max scaling projects a variable onto the interval \(\left[0, 1\right]\), where the minimum observed value is sent to \(0\) and the maximum observed value is sent to \(1\).

    • Min/Max scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \min\left(x\right)}{\max\left(x\right) - \min\left(x\right)}}\).
    • We can add step_range() to a recipe() to min/max scale a feature.
  • Standardization converts a variables raw measurements into standard deviations (\(z\)-scores), where the mean of the transformed variable is \(0\) and the standard deviation of the transformed variable is \(1\).

    • Standardized scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \text{mean}\left(x\right)}{\text{sd}(x)}}\)
    • We can add step_normalize() to a recipe() to standardize the values of a feature.

Fitting a Model Using Ridge Regression or the LASSO

The tidymodels framework provides us with a standardized structure for defining and fitting models. Everything we’ve learned about the syntax and workflow for fitting a linear regression model using OLS can be used to fit these Generalized Linear Models. We’ll just need to set_engine() to something other than "lm" – specifically, an engine which can fit models using the constraints defined in the Ridge Regression and LASSO sections above. We’ll use "glmnet", which requires two additional arguments mixture and penalty.

  • The mixture argument is a number between \(0\) and \(1\). Setting mixture = 1 results in a LASSO model, while setting mixture = 0 results in a Ridge Regression model. Values in between \(0\) and \(1\) result in a mixed LASSO/Ridge approach, where the penalty is partially determined by the Ridge Regression constraint and partially by the LASSO constraint.
  • The penalty argument corresponds to the amount of regularization being applied – that is, the penalty is related to the budget parameter we’ve described earlier.

Ummm…Great – So What Do I Use and When Do I Use It?

There is some really good news coming – just bear with me!

The choice between Ridge Regression and LASSO depends on your goals. If you are looking for a tool to help with variable selection, then the LASSO is your choice, since it will result in less valuable predictors being assigned coefficients of exactly \(0\). If you are building a predictive model, you might try both and see which one performs better for your specific use-case.

How do I choose a penalty? As I understand it, there’s no great science to choosing a penalty. People will typically try values and see what the best ones are for their particular use-case.

So basically I’ve told you about these two new classes of model that help prevent overfitting. They each require a penalty parameter and I’ve given you no real guidance on how to choose it. For now, we’ll simply choose a penalty parameter and mention what we could build several models with different penalty values to try and find one that performs best. I’ll give you a much better method when we talk about hyperparameter tuning.

Implementing Ridge Regression and the LASSO

There are a few additional concerns with fitting Ridge Regression and LASSO models that we haven’t needed to deal with prior.

  • The glmnet engine requires that no missing values are included in any fold.

    • We can omit rows with missing data. (drawback: we are throwing away observations and we may be drastically reducing the size of our available data)
    • We can omit features with missing data. (drawback: we are sacrificing potentially valuable predictors by omitting them from our model)
    • We can impute missing values using a step_impute_*() feature engineering step in our recipe. (drawback: we are making our best guess at what the missing value should be, but we are introducing additional uncertainty to our model)
  • Because the penalty parameter imposes a budget on coefficients, all of our predictors must be on the same scale – otherwise, some predictors are artificially cheaper or more expensive than others to include in our model.

    • We can use step_range() or step_normalize() on all_numerical_predictors() to achieve this.

We’ll filter out the observations with an unknown response (Sale_Price), and we’ll use step_impute_knn() to use a \(k\)-nearest neighbors imputation scheme for any missing values in our predictor columns. Finally, we’ll use step_normalize() to center and scale our numerical predictors. Then we’ll proceed as usual.

Implementing Ridge Regression

We’ll build and assess a Ridge Regression model first.

set.seed(123)
ames_known_price <- ames %>%
  filter(!(is.na(Sale_Price)))
ames_split <- initial_split(ames_known_price, prop = 0.9)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_folds <- vfold_cv(ames_train, v = 5)

ridge_reg_spec <- linear_reg(mixture = 0, penalty = 1e4) %>%
  set_engine("glmnet")

reg_rec <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_impute_knn(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

ridge_reg_wf <- workflow() %>%
 add_model(ridge_reg_spec) %>%
 add_recipe(reg_rec)

ridge_reg_cv <- ridge_reg_wf %>%
  fit_resamples(ames_folds)

ridge_reg_cv %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
.metric .estimator mean n std_err .config
rmse standard 31500.135966 5 3295.091381 Preprocessor1_Model1
rsq standard 0.848782 5 0.020622 Preprocessor1_Model1
ridge_reg_cv %>%
  collect_metrics(summarize = FALSE) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
id .metric .estimator .estimate .config
Fold1 rmse standard 40380.8832905 Preprocessor1_Model1
Fold1 rsq standard 0.8113437 Preprocessor1_Model1
Fold2 rmse standard 32296.5525647 Preprocessor1_Model1
Fold2 rsq standard 0.8502330 Preprocessor1_Model1
Fold3 rmse standard 23650.9651714 Preprocessor1_Model1
Fold3 rsq standard 0.8972481 Preprocessor1_Model1
Fold4 rmse standard 36684.1034145 Preprocessor1_Model1
Fold4 rsq standard 0.7941962 Preprocessor1_Model1
Fold5 rmse standard 24488.1753889 Preprocessor1_Model1
Fold5 rsq standard 0.8908891 Preprocessor1_Model1
ridge_reg_fit <- ridge_reg_wf %>%
  fit(ames_train)

ridge_reg_fit %>%
  tidy() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
term estimate penalty
(Intercept) 188791.70984 10000
Lot_Frontage 796.31887 10000
Lot_Area 2636.36801 10000
Year_Built 3063.69378 10000
Year_Remod_Add 4628.39736 10000
Mas_Vnr_Area 4480.80370 10000
BsmtFin_SF_1 -539.71337 10000
BsmtFin_SF_2 791.80008 10000
Bsmt_Unf_SF -1879.28956 10000
Total_Bsmt_SF 5911.99989 10000
First_Flr_SF 7305.49061 10000
Second_Flr_SF 8181.67116 10000
Low_Qual_Fin_SF -524.69494 10000
Gr_Liv_Area 12300.92624 10000
Bsmt_Full_Bath 2874.76949 10000
Bsmt_Half_Bath -658.83986 10000
Full_Bath 4221.38633 10000
Half_Bath 2233.83038 10000
Bedroom_AbvGr -889.11549 10000
Kitchen_AbvGr -2422.88820 10000
TotRms_AbvGrd 3417.43895 10000
Fireplaces 4536.95443 10000
Garage_Cars 5041.60150 10000
Garage_Area 2824.62363 10000
Wood_Deck_SF 1555.87237 10000
Open_Porch_SF -284.75723 10000
Enclosed_Porch 1226.81360 10000
Three_season_porch 486.32894 10000
Screen_Porch 3534.36652 10000
Pool_Area -1040.13291 10000
Misc_Val -4752.01219 10000
Mo_Sold -405.39528 10000
Year_Sold -1199.00037 10000
Longitude 769.72448 10000
Latitude 4390.84512 10000
MS_SubClass_One_Story_1945_and_Older -4832.96721 10000
MS_SubClass_One_Story_1946_and_Newer_All_Styles 3202.29338 10000
MS_SubClass_One_Story_PUD_1946_and_Newer -3648.19533 10000
MS_SubClass_Two_Story_1946_and_Newer 665.56838 10000
MS_SubClass_other 287.65821 10000
MS_Zoning_Residential_Medium_Density -5645.17735 10000
MS_Zoning_other -6902.63199 10000
Street_other -23356.12546 10000
Alley_other -110.60832 10000
Lot_Shape_Slightly_Irregular 2411.21852 10000
Lot_Shape_other 1953.92542 10000
Land_Contour_other 136.73405 10000
Utilities_other -15838.48097 10000
Lot_Config_CulDSac 10227.06658 10000
Lot_Config_Inside 238.57320 10000
Lot_Config_other -5644.27896 10000
Land_Slope_other 4960.68588 10000
Neighborhood_Edwards -8912.65959 10000
Neighborhood_Gilbert -15851.15782 10000
Neighborhood_North_Ames -4943.77431 10000
Neighborhood_Northridge_Heights 18963.37114 10000
Neighborhood_Old_Town -6176.63079 10000
Neighborhood_Somerset 8825.83248 10000
Neighborhood_other 4868.82978 10000
Condition_1_Norm 9690.18424 10000
Condition_1_other 1658.63595 10000
Condition_2_other 121.73332 10000
Bldg_Type_TwnhsE -10559.04514 10000
Bldg_Type_other -11382.12086 10000
House_Style_One_Story 2307.09727 10000
House_Style_Two_Story -2111.32137 10000
House_Style_other -877.41700 10000
Overall_Qual_Average -1984.73829 10000
Overall_Qual_Below_Average -3299.97991 10000
Overall_Qual_Good 1126.18427 10000
Overall_Qual_Very_Good 14813.63526 10000
Overall_Qual_other 29424.63035 10000
Overall_Cond_Average -2036.76744 10000
Overall_Cond_Good 4967.76736 10000
Overall_Cond_other -3677.52163 10000
Roof_Style_Hip 5321.81005 10000
Roof_Style_other -4334.05469 10000
Roof_Matl_other 2536.26513 10000
Exterior_1st_MetalSd 1789.15098 10000
Exterior_1st_Plywood 245.73682 10000
Exterior_1st_VinylSd 763.44060 10000
Exterior_1st_Wd.Sdng 39.68877 10000
Exterior_1st_other 6045.78572 10000
Exterior_2nd_MetalSd 1876.79559 10000
Exterior_2nd_Plywood -3296.46398 10000
Exterior_2nd_VinylSd 924.71122 10000
Exterior_2nd_Wd.Sdng 3394.08267 10000
Exterior_2nd_other -407.40383 10000
Mas_Vnr_Type_None 5526.82124 10000
Mas_Vnr_Type_Stone 3042.59449 10000
Mas_Vnr_Type_other -9062.56602 10000
Exter_Qual_Typical -6768.82107 10000
Exter_Qual_other 16684.08382 10000
Exter_Cond_Typical -742.10084 10000
Exter_Cond_other -14408.92915 10000
Foundation_CBlock -2418.14587 10000
Foundation_PConc 4372.23813 10000
Foundation_other 269.13186 10000
Bsmt_Qual_Good -12590.03081 10000
Bsmt_Qual_Typical -11360.01363 10000
Bsmt_Qual_other -11250.01642 10000
Bsmt_Cond_other -2698.01896 10000
Bsmt_Exposure_Gd 14083.73631 10000
Bsmt_Exposure_Mn -4822.50510 10000
Bsmt_Exposure_No -8071.95550 10000
Bsmt_Exposure_other -5347.16020 10000
BsmtFin_Type_1_BLQ -804.11423 10000
BsmtFin_Type_1_GLQ 7399.35027 10000
BsmtFin_Type_1_LwQ -4119.78777 10000
BsmtFin_Type_1_Rec -790.39767 10000
BsmtFin_Type_1_Unf -2218.54216 10000
BsmtFin_Type_1_other -3351.91538 10000
BsmtFin_Type_2_other -1889.50312 10000
Heating_other -1479.67159 10000
Heating_QC_Good -2440.28159 10000
Heating_QC_Typical -5255.01167 10000
Heating_QC_other -10308.11682 10000
Central_Air_Y 4646.75279 10000
Electrical_SBrkr 318.95670 10000
Electrical_other -146.28148 10000
Kitchen_Qual_Good -10595.62477 10000
Kitchen_Qual_Typical -13183.23141 10000
Kitchen_Qual_other -15338.12342 10000
Functional_other -15019.47361 10000
Fireplace_Qu_No_Fireplace -947.63813 10000
Fireplace_Qu_Typical -2047.88316 10000
Fireplace_Qu_other 544.63828 10000
Garage_Type_BuiltIn 592.47646 10000
Garage_Type_Detchd -1803.54659 10000
Garage_Type_No_Garage 212.40839 10000
Garage_Type_other -11813.03977 10000
Garage_Finish_No_Garage 378.46275 10000
Garage_Finish_RFn -4021.15008 10000
Garage_Finish_Unf -1988.47596 10000
Garage_Qual_Typical -1342.56651 10000
Garage_Qual_other 2158.84972 10000
Garage_Cond_Typical 2395.30760 10000
Garage_Cond_other -6099.84155 10000
Paved_Drive_Paved 4144.04509 10000
Paved_Drive_other 5128.66923 10000
Pool_QC_other 8473.62349 10000
Fence_No_Fence -1923.52356 10000
Fence_other -832.59507 10000
Misc_Feature_other 5427.83293 10000
Sale_Type_WD -4415.50041 10000
Sale_Type_other -6892.17869 10000
Sale_Condition_Normal 6815.41569 10000
Sale_Condition_Partial 11236.53847 10000
Sale_Condition_other 5741.78860 10000

From the regression output above, we can see some predictors getting large coefficients, while others recieve coefficients on a smaller scale.

Implementing the LASSO

The work required to construct a LASSO model is nearly identical. We simply use mixture = 1 in the model specification to signal that we want to use the LASSO constraint instead of the ridge constraint.

lasso_reg_spec <- linear_reg(mixture = 1, penalty = 1e4) %>%
  set_engine("glmnet")

lasso_reg_wf <- workflow() %>%
 add_model(lasso_reg_spec) %>%
 add_recipe(reg_rec)

lasso_reg_cv <- lasso_reg_wf %>%
  fit_resamples(ames_folds)

lasso_reg_cv %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
.metric .estimator mean n std_err .config
rmse standard 39400.740143 5 3246.8426135 Preprocessor1_Model1
rsq standard 0.797465 5 0.0229212 Preprocessor1_Model1
lasso_reg_cv %>%
  collect_metrics(summarize = FALSE) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
id .metric .estimator .estimate .config
Fold1 rmse standard 48609.2138434 Preprocessor1_Model1
Fold1 rsq standard 0.7762703 Preprocessor1_Model1
Fold2 rmse standard 42312.8702790 Preprocessor1_Model1
Fold2 rsq standard 0.7742343 Preprocessor1_Model1
Fold3 rmse standard 31940.0083101 Preprocessor1_Model1
Fold3 rsq standard 0.8491859 Preprocessor1_Model1
Fold4 rmse standard 42122.7110247 Preprocessor1_Model1
Fold4 rsq standard 0.7354344 Preprocessor1_Model1
Fold5 rmse standard 32018.8972595 Preprocessor1_Model1
Fold5 rsq standard 0.8522000 Preprocessor1_Model1
lasso_reg_fit <- lasso_reg_wf %>%
  fit(ames_train)

lasso_reg_fit %>%
  tidy() %>%
  filter(estimate != 0) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
term estimate penalty
(Intercept) 184321.933009 10000
Year_Built 7211.041919 10000
Year_Remod_Add 4892.535342 10000
Mas_Vnr_Area 2528.604376 10000
Total_Bsmt_SF 11760.066322 10000
Gr_Liv_Area 25031.362051 10000
Fireplaces 3468.087043 10000
Garage_Cars 7144.990272 10000
Garage_Area 3629.553856 10000
Neighborhood_Northridge_Heights 15138.287188 10000
Overall_Qual_Very_Good 1102.160399 10000
Overall_Qual_other 20288.292906 10000
Exter_Qual_Typical -10032.425048 10000
Bsmt_Exposure_Gd 8094.334611 10000
BsmtFin_Type_1_GLQ 5624.878842 10000
Kitchen_Qual_Typical -4541.882839 10000
Fireplace_Qu_No_Fireplace -4.359246 10000

We can see in the LASSO model that many of the predictors were assigned coefficients of \(0\). The LASSO procedure identified that those predictors weren’t worth “buying”, so it left them out and only included the most useful predictors!

Summary

In this notebook, we introduced two new regression models: Ridge Regression and the LASSO. These models are of the familiar form \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\), but are fit using a constrained optimization procedure rather than ordinary least squares. When we implement Ridge and the LASSO, we are reducing the likelihood that the resulting model is overfit. With these models, however, we must remember that all numerical predictors need to be scaled and no missing data can be present in the training data.