Variable Selection Methods: Regularization with Ridge and LASSO

Published

August 17, 2024

Recap

With Cross-Validation, we’ve supercharged our modeling powers. We’ve made our model performance estimates much more reliable, which gives us greater confidence in understanding the predictive power (or lack-thereof) in our models. We’ve become much more responsible modelers, in this sense. We’ll see again soon that cross-validation can actually do much more for us than just provide more reliable performance estimates than the validation-set approach did. For now though, we shift focus slightly to the question – how do we know which predictors should be included in our model?

Motivation

When we first introduced multiple linear regression, we took an approach in which we built a “large” model – a model that included most/all of our available predictors. Once we had that model, we used a procedure called backward elimination to eliminate terms from our model if they were associated with \(p\)-values above the 5% significance threshold. We could have taken the opposite approach, starting with an empty model, adding predictors/terms one at a time – a process called forward selection.

With both of these methods, we’re allowing our model to be quite greedy. It may be helpful to think of these processes as allowing our model to go “shopping” for predictors – in the backward selection paradigm, the model begins by adding every item in the store to its shopping cart, and then carefully places the ones it doesn’t want back on the shelves – in the forward selection paradigm, the model begins with an empty cart and adds its favorite items one-by-one into its shopping cart. This seems quite reasonable at first, but we’re engaging in quite risky behavior.

From a purely statistical standpoint, we’re evaluating lots of \(t\)-tests in determining whether model terms are statistically significant or not. The likelihood of making at least one Type I Error (saying that a model term is statistically significant, when it is not so in the population) becomes very large in these processes.
From a model-fit perspective, the more predictor variables a model has, the more flexible it is – the more flexible a model is, the better it will fit the training data and the more likely it is to become overfit.

By allowing our model to “shop” freely for its predictors, we are enticing our model to overfit the training data – giving our model a “budget” to spend on its shopping trip would be a really nice way to lower the likelihood that our model “buys” too many predictors and overfits the training data.

Optimization Procedures for Model Fitting

All of the models we’ve fit so far in our course use Ordinary Least Squares as the underlying fitting procedure – we are moving to new waters now. Ridge Regression and LASSO models belong to a class of model called Generalized Linear Models, although their fitting procedures under the hood will be different, the consistency of the {tidymodels} framework allows us to make this change quite simply.

Regularization

The process of applying an additional constraint such as a budget one example of model constraints commonly referred to as regularization. Regularization techniques are utilized with lots of model classes to help prevent those models from overfitting our training data.

Objectives

After working through this notebook you should be able to:

Articulate the backward selection approach and how it relates to the multiple linear regression models we’ve constructed and analyzed so far in our course.
Discuss forward selection as an alternative to the backward selection process.
Discuss and implement Ridge Regression and the LASSO as alternatives to Ordinary Least Squares within the {tidymodels} framework.
Discuss the benefits and drawbacks for forward- and backward-selection, Ridge Regression, and the LASSO relative to one another.

Ordinary Least Squares

The process of fitting a model using ordinary least squares is an optimization problem. Simply put, to fit a model of the form

\[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] we minimize the residual sum of squares:

\[\text{Minimize:}~~\sum_{\text{training data}}{\left(y_{\text{observed}} - y_{\text{predicted}}\right)^2}\]

Which is equivalent to

\[\text{Minimize:}~\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}\]

This process of choosing values for \(\beta_0,~\beta_1,~\beta_2,~\cdots,~\beta_k\) is called Ordinary Least Squares (OLS), and it’s what we’ve been using (or, rather R has been using) all semester long to fit our models.

In OLS, the optimization procedure can freely choose those \(\beta\) values, without any constraint (OLS is shopping without a budget). Ridge Regression and the LASSO introduce extra constraints on this optimization problem.

Ridge Regression

The process of fitting a model using Ridge Regression is similar to that of using OLS, except that a constraint on the chosen \(\beta\) coefficients is imposed. That is, using Ridge Regression to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem:

\[\begin{array}{ll}\text{Minimize:} & \displaystyle{\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}}\\ \text{Subject To:} & \displaystyle{\sum_{i=1}^{k}{\beta_i^2} \leq C}\end{array}\]

Where \(C\) can be thought of as a total coefficient budget. That is, it becomes very expensive for the model to assign lots of non-zero coefficients.

Least Absolute Shrinkage and Selection Operator (LASSO)

The difference between Ridge Regression and the LASSO is in how “spent budget” is calculated. Instead of summing the squares of the \(\beta\) coefficients, we’ll sum their absolute values. That is, using LASSO to fit a model of the form \[\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] Solves the following optimization problem: \[\begin{array}{ll}\text{Minimize:} & \displaystyle{\sum_{\text{training data}}{\left(y_{\text{observed}} - \left(\beta_0 + \sum_{i = 1}^{k}{\beta_i x_i}\right)\right)^2}}\\ \text{Subject To:} & \displaystyle{\sum_{i=1}^{k}{\left|\beta_i\right|} \leq C}\end{array}\]

Where \(C\) can again be thought of as a total coefficient budget. Because it squares the \(\beta\) coefficients in the calculation of “budget used”. Furthermore, due to the mathematics behind these algorithms, models fit using the LASSO result in small coefficients being sent to \(0\). That is, the LASSO can be used as a variable selection procedure. Models fit with Ridge Regression often leave several predictors having very small coefficients so, while they don’t have much influence in the overall model predictions, they do still remain in the model.

Some Feature PreProcessing Concerns

If we are going to utilize Ridge Regression or the LASSO, we need to ensure that all of our predictors are on a standardized scale. Otherwise, predictors/features whose observed values are large are superficially cheaper for the model to use (because we can attach small coefficients to them) than features whose observed values are already small. Similarly, features whose observed values are very small are artificially made more expensive for the model to use because they demand larger coefficients in order to have influence over model predictions. There are two very common scaling methods:

Min/Max scaling projects a variable onto the interval \(\left[0, 1\right]\), where the minimum observed value is sent to \(0\) and the maximum observed value is sent to \(1\).
- Min/Max scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \min\left(x\right)}{\max\left(x\right) - \min\left(x\right)}}\).
- We can add step_range() to a recipe() to min/max scale a feature.
Standardization converts a variables raw measurements into standard deviations (\(z\)-scores), where the mean of the transformed variable is \(0\) and the standard deviation of the transformed variable is \(1\).
- Standardized scaling for the variable \(x\) is done using the formula: \(\displaystyle{\frac{x - \text{mean}\left(x\right)}{\text{sd}(x)}}\)
- We can add step_normalize() to a recipe() to standardize the values of a feature.

Fitting a Model Using Ridge Regression or the LASSO

The tidymodels framework provides us with a standardized structure for defining and fitting models. Everything we’ve learned about the syntax and workflow for fitting a linear regression model using OLS can be used to fit these Generalized Linear Models. We’ll just need to set_engine() to something other than "lm" – specifically, an engine which can fit models using the constraints defined in the Ridge Regression and LASSO sections above. We’ll use "glmnet", which requires two additional arguments mixture and penalty.

The mixture argument is a number between \(0\) and \(1\). Setting mixture = 1 results in a LASSO model, while setting mixture = 0 results in a Ridge Regression model. Values in between \(0\) and \(1\) result in a mixed LASSO/Ridge approach, where the penalty is partially determined by the Ridge Regression constraint and partially by the LASSO constraint.
The penalty argument corresponds to the amount of regularization being applied – that is, the penalty is related to the budget parameter we’ve described earlier.

Ummm…Great – So What Do I Use and When Do I Use It?

There is some really good news coming – just bear with me!

The choice between Ridge Regression and LASSO depends on your goals. If you are looking for a tool to help with variable selection, then the LASSO is your choice, since it will result in less valuable predictors being assigned coefficients of exactly \(0\). If you are building a predictive model, you might try both and see which one performs better for your specific use-case.

How do I choose a penalty? As I understand it, there’s no great science to choosing a penalty. People will typically try values and see what the best ones are for their particular use-case.

So basically I’ve told you about these two new classes of model that help prevent overfitting. They each require a penalty parameter and I’ve given you no real guidance on how to choose it. For now, we’ll simply choose a penalty parameter and mention what we could build several models with different penalty values to try and find one that performs best. I’ll give you a much better method when we talk about hyperparameter tuning.

Implementing Ridge Regression and the LASSO

There are a few additional concerns with fitting Ridge Regression and LASSO models that we haven’t needed to deal with prior.

The glmnet engine requires that no missing values are included in any fold.
- We can omit rows with missing data. (drawback: we are throwing away observations and we may be drastically reducing the size of our available data)
- We can omit features with missing data. (drawback: we are sacrificing potentially valuable predictors by omitting them from our model)
- We can impute missing values using a step_impute_*() feature engineering step in our recipe. (drawback: we are making our best guess at what the missing value should be, but we are introducing additional uncertainty to our model)
Because the penalty parameter imposes a budget on coefficients, all of our predictors must be on the same scale – otherwise, some predictors are artificially cheaper or more expensive than others to include in our model.
- We can use step_range() or step_normalize() on all_numerical_predictors() to achieve this.

We’ll filter out the observations with an unknown response (Sale_Price), and we’ll use step_impute_knn() to use a \(k\)-nearest neighbors imputation scheme for any missing values in our predictor columns. Finally, we’ll use step_normalize() to center and scale our numerical predictors. Then we’ll proceed as usual.

Implementing Ridge Regression

We’ll build and assess a Ridge Regression model first.

set.seed(123)
ames_known_price <- ames %>%
  filter(!(is.na(Sale_Price)))
ames_split <- initial_split(ames_known_price, prop = 0.9)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_folds <- vfold_cv(ames_train, v = 5)

ridge_reg_spec <- linear_reg(mixture = 0, penalty = 1e4) %>%
  set_engine("glmnet")

reg_rec <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_impute_knn(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

ridge_reg_wf <- workflow() %>%
 add_model(ridge_reg_spec) %>%
 add_recipe(reg_rec)

ridge_reg_cv <- ridge_reg_wf %>%
  fit_resamples(ames_folds)

ridge_reg_cv %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

.metric	.estimator	mean	n	std_err	.config
rmse	standard	31500.135966	5	3295.091381	Preprocessor1_Model1
rsq	standard	0.848782	5	0.020622	Preprocessor1_Model1

ridge_reg_cv %>%
  collect_metrics(summarize = FALSE) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

id	.metric	.estimator	.estimate	.config
Fold1	rmse	standard	40380.8832905	Preprocessor1_Model1
Fold1	rsq	standard	0.8113437	Preprocessor1_Model1
Fold2	rmse	standard	32296.5525647	Preprocessor1_Model1
Fold2	rsq	standard	0.8502330	Preprocessor1_Model1
Fold3	rmse	standard	23650.9651714	Preprocessor1_Model1
Fold3	rsq	standard	0.8972481	Preprocessor1_Model1
Fold4	rmse	standard	36684.1034145	Preprocessor1_Model1
Fold4	rsq	standard	0.7941962	Preprocessor1_Model1
Fold5	rmse	standard	24488.1753889	Preprocessor1_Model1
Fold5	rsq	standard	0.8908891	Preprocessor1_Model1

ridge_reg_fit <- ridge_reg_wf %>%
  fit(ames_train)

ridge_reg_fit %>%
  tidy() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

term	estimate	penalty
(Intercept)	188791.70984	10000
Lot_Frontage	796.31887	10000
Lot_Area	2636.36801	10000
Year_Built	3063.69378	10000
Year_Remod_Add	4628.39736	10000
Mas_Vnr_Area	4480.80370	10000
BsmtFin_SF_1	-539.71337	10000
BsmtFin_SF_2	791.80008	10000
Bsmt_Unf_SF	-1879.28956	10000
Total_Bsmt_SF	5911.99989	10000
First_Flr_SF	7305.49061	10000
Second_Flr_SF	8181.67116	10000
Low_Qual_Fin_SF	-524.69494	10000
Gr_Liv_Area	12300.92624	10000
Bsmt_Full_Bath	2874.76949	10000
Bsmt_Half_Bath	-658.83986	10000
Full_Bath	4221.38633	10000
Half_Bath	2233.83038	10000
Bedroom_AbvGr	-889.11549	10000
Kitchen_AbvGr	-2422.88820	10000
TotRms_AbvGrd	3417.43895	10000
Fireplaces	4536.95443	10000
Garage_Cars	5041.60150	10000
Garage_Area	2824.62363	10000
Wood_Deck_SF	1555.87237	10000
Open_Porch_SF	-284.75723	10000
Enclosed_Porch	1226.81360	10000
Three_season_porch	486.32894	10000
Screen_Porch	3534.36652	10000
Pool_Area	-1040.13291	10000
Misc_Val	-4752.01219	10000
Mo_Sold	-405.39528	10000
Year_Sold	-1199.00037	10000
Longitude	769.72448	10000
Latitude	4390.84512	10000
MS_SubClass_One_Story_1945_and_Older	-4832.96721	10000
MS_SubClass_One_Story_1946_and_Newer_All_Styles	3202.29338	10000
MS_SubClass_One_Story_PUD_1946_and_Newer	-3648.19533	10000
MS_SubClass_Two_Story_1946_and_Newer	665.56838	10000
MS_SubClass_other	287.65821	10000
MS_Zoning_Residential_Medium_Density	-5645.17735	10000
MS_Zoning_other	-6902.63199	10000
Street_other	-23356.12546	10000
Alley_other	-110.60832	10000
Lot_Shape_Slightly_Irregular	2411.21852	10000
Lot_Shape_other	1953.92542	10000
Land_Contour_other	136.73405	10000
Utilities_other	-15838.48097	10000
Lot_Config_CulDSac	10227.06658	10000
Lot_Config_Inside	238.57320	10000
Lot_Config_other	-5644.27896	10000
Land_Slope_other	4960.68588	10000
Neighborhood_Edwards	-8912.65959	10000
Neighborhood_Gilbert	-15851.15782	10000
Neighborhood_North_Ames	-4943.77431	10000
Neighborhood_Northridge_Heights	18963.37114	10000
Neighborhood_Old_Town	-6176.63079	10000
Neighborhood_Somerset	8825.83248	10000
Neighborhood_other	4868.82978	10000
Condition_1_Norm	9690.18424	10000
Condition_1_other	1658.63595	10000
Condition_2_other	121.73332	10000
Bldg_Type_TwnhsE	-10559.04514	10000
Bldg_Type_other	-11382.12086	10000
House_Style_One_Story	2307.09727	10000
House_Style_Two_Story	-2111.32137	10000
House_Style_other	-877.41700	10000
Overall_Qual_Average	-1984.73829	10000
Overall_Qual_Below_Average	-3299.97991	10000
Overall_Qual_Good	1126.18427	10000
Overall_Qual_Very_Good	14813.63526	10000
Overall_Qual_other	29424.63035	10000
Overall_Cond_Average	-2036.76744	10000
Overall_Cond_Good	4967.76736	10000
Overall_Cond_other	-3677.52163	10000
Roof_Style_Hip	5321.81005	10000
Roof_Style_other	-4334.05469	10000
Roof_Matl_other	2536.26513	10000
Exterior_1st_MetalSd	1789.15098	10000
Exterior_1st_Plywood	245.73682	10000
Exterior_1st_VinylSd	763.44060	10000
Exterior_1st_Wd.Sdng	39.68877	10000
Exterior_1st_other	6045.78572	10000
Exterior_2nd_MetalSd	1876.79559	10000
Exterior_2nd_Plywood	-3296.46398	10000
Exterior_2nd_VinylSd	924.71122	10000
Exterior_2nd_Wd.Sdng	3394.08267	10000
Exterior_2nd_other	-407.40383	10000
Mas_Vnr_Type_None	5526.82124	10000
Mas_Vnr_Type_Stone	3042.59449	10000
Mas_Vnr_Type_other	-9062.56602	10000
Exter_Qual_Typical	-6768.82107	10000
Exter_Qual_other	16684.08382	10000
Exter_Cond_Typical	-742.10084	10000
Exter_Cond_other	-14408.92915	10000
Foundation_CBlock	-2418.14587	10000
Foundation_PConc	4372.23813	10000
Foundation_other	269.13186	10000
Bsmt_Qual_Good	-12590.03081	10000
Bsmt_Qual_Typical	-11360.01363	10000
Bsmt_Qual_other	-11250.01642	10000
Bsmt_Cond_other	-2698.01896	10000
Bsmt_Exposure_Gd	14083.73631	10000
Bsmt_Exposure_Mn	-4822.50510	10000
Bsmt_Exposure_No	-8071.95550	10000
Bsmt_Exposure_other	-5347.16020	10000
BsmtFin_Type_1_BLQ	-804.11423	10000
BsmtFin_Type_1_GLQ	7399.35027	10000
BsmtFin_Type_1_LwQ	-4119.78777	10000
BsmtFin_Type_1_Rec	-790.39767	10000
BsmtFin_Type_1_Unf	-2218.54216	10000
BsmtFin_Type_1_other	-3351.91538	10000
BsmtFin_Type_2_other	-1889.50312	10000
Heating_other	-1479.67159	10000
Heating_QC_Good	-2440.28159	10000
Heating_QC_Typical	-5255.01167	10000
Heating_QC_other	-10308.11682	10000
Central_Air_Y	4646.75279	10000
Electrical_SBrkr	318.95670	10000
Electrical_other	-146.28148	10000
Kitchen_Qual_Good	-10595.62477	10000
Kitchen_Qual_Typical	-13183.23141	10000
Kitchen_Qual_other	-15338.12342	10000
Functional_other	-15019.47361	10000
Fireplace_Qu_No_Fireplace	-947.63813	10000
Fireplace_Qu_Typical	-2047.88316	10000
Fireplace_Qu_other	544.63828	10000
Garage_Type_BuiltIn	592.47646	10000
Garage_Type_Detchd	-1803.54659	10000
Garage_Type_No_Garage	212.40839	10000
Garage_Type_other	-11813.03977	10000
Garage_Finish_No_Garage	378.46275	10000
Garage_Finish_RFn	-4021.15008	10000
Garage_Finish_Unf	-1988.47596	10000
Garage_Qual_Typical	-1342.56651	10000
Garage_Qual_other	2158.84972	10000
Garage_Cond_Typical	2395.30760	10000
Garage_Cond_other	-6099.84155	10000
Paved_Drive_Paved	4144.04509	10000
Paved_Drive_other	5128.66923	10000
Pool_QC_other	8473.62349	10000
Fence_No_Fence	-1923.52356	10000
Fence_other	-832.59507	10000
Misc_Feature_other	5427.83293	10000
Sale_Type_WD	-4415.50041	10000
Sale_Type_other	-6892.17869	10000
Sale_Condition_Normal	6815.41569	10000
Sale_Condition_Partial	11236.53847	10000
Sale_Condition_other	5741.78860	10000

From the regression output above, we can see some predictors getting large coefficients, while others recieve coefficients on a smaller scale.

Implementing the LASSO

The work required to construct a LASSO model is nearly identical. We simply use mixture = 1 in the model specification to signal that we want to use the LASSO constraint instead of the ridge constraint.

lasso_reg_spec <- linear_reg(mixture = 1, penalty = 1e4) %>%
  set_engine("glmnet")

lasso_reg_wf <- workflow() %>%
 add_model(lasso_reg_spec) %>%
 add_recipe(reg_rec)

lasso_reg_cv <- lasso_reg_wf %>%
  fit_resamples(ames_folds)

lasso_reg_cv %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

.metric	.estimator	mean	n	std_err	.config
rmse	standard	39400.740143	5	3246.8426135	Preprocessor1_Model1
rsq	standard	0.797465	5	0.0229212	Preprocessor1_Model1

lasso_reg_cv %>%
  collect_metrics(summarize = FALSE) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

id	.metric	.estimator	.estimate	.config
Fold1	rmse	standard	48609.2138434	Preprocessor1_Model1
Fold1	rsq	standard	0.7762703	Preprocessor1_Model1
Fold2	rmse	standard	42312.8702790	Preprocessor1_Model1
Fold2	rsq	standard	0.7742343	Preprocessor1_Model1
Fold3	rmse	standard	31940.0083101	Preprocessor1_Model1
Fold3	rsq	standard	0.8491859	Preprocessor1_Model1
Fold4	rmse	standard	42122.7110247	Preprocessor1_Model1
Fold4	rsq	standard	0.7354344	Preprocessor1_Model1
Fold5	rmse	standard	32018.8972595	Preprocessor1_Model1
Fold5	rsq	standard	0.8522000	Preprocessor1_Model1

lasso_reg_fit <- lasso_reg_wf %>%
  fit(ames_train)

lasso_reg_fit %>%
  tidy() %>%
  filter(estimate != 0) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

term	estimate	penalty
(Intercept)	184321.933009	10000
Year_Built	7211.041919	10000
Year_Remod_Add	4892.535342	10000
Mas_Vnr_Area	2528.604376	10000
Total_Bsmt_SF	11760.066322	10000
Gr_Liv_Area	25031.362051	10000
Fireplaces	3468.087043	10000
Garage_Cars	7144.990272	10000
Garage_Area	3629.553856	10000
Neighborhood_Northridge_Heights	15138.287188	10000
Overall_Qual_Very_Good	1102.160399	10000
Overall_Qual_other	20288.292906	10000
Exter_Qual_Typical	-10032.425048	10000
Bsmt_Exposure_Gd	8094.334611	10000
BsmtFin_Type_1_GLQ	5624.878842	10000
Kitchen_Qual_Typical	-4541.882839	10000
Fireplace_Qu_No_Fireplace	-4.359246	10000

We can see in the LASSO model that many of the predictors were assigned coefficients of \(0\). The LASSO procedure identified that those predictors weren’t worth “buying”, so it left them out and only included the most useful predictors!

Summary

In this notebook, we introduced two new regression models: Ridge Regression and the LASSO. These models are of the familiar form \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\), but are fit using a constrained optimization procedure rather than ordinary least squares. When we implement Ridge and the LASSO, we are reducing the likelihood that the resulting model is overfit. With these models, however, we must remember that all numerical predictors need to be scaled and no missing data can be present in the training data.