October 4, 2024
What is Multiple Linear Regression?
What are we assuming?
Tests for Model Utility
Further model assessment
Model interpretation
Predictions
If all this seems nearly identical to our discussion on simple linear regression models, well…that’s because it is.
Why choose just one predictor when we could have them all?
\[\begin{align} \mathbb{E}\left[\text{body mass}\right] = \beta_0 + &\beta_1\cdot\left(\text{bill length}\right) + \beta_2\cdot\left(\text{bill depth}\right) +\\ & \beta_3\cdot\left(\text{flipper length}\right) + \beta_4\cdot\left(\text{year}\right)\end{align}\]
Question 1 (Predictive): Can we use penguin bill length, bill depth, flipper length, and observation year to predict body mass?
Question 2 (Inferential): Controlling for bill length, bill depth, and observation year, what is the association between flipper length and penguin body mass?
Question 3 (Inferential): Controlling for flipper length, bill depth, and observation year, what is the association between bill length and penguin body mass?
Question 4 (Inferential): Controlling for flipper length, bill length, and observation year, what is the association between bill depth and penguin body mass?
Question 5 (Inferential): Controlling for flipper length, bill length, and bill depth, what is the association between observation year and penguin body mass?
\(\bigstar\) Let’s try it! \(\bigstar\)
\[\begin{align} \mathbb{E}\left[\text{body mass}\right] = \beta_0 + &\beta_1\cdot\left(\text{bill length}\right) + \beta_2\cdot\left(\text{bill depth}\right) +\\ & \beta_3\cdot\left(\text{flipper length}\right) + \beta_4\cdot\left(\text{year}\right)\end{align}\]
Pre-Modeling Assumptions: Penguin body mass is associated with each of the four predictors in a linear manner, the predictors are not correlated with one another, and body mass is independent of all other possible features.
Post-Modeling Assumptions: The following assumptions are made about model errors (residuals), to ensure that using and interpreting the model is appropriate.
{tidymodels}
{tidymodels}
\(\bigstar\) Let’s try it! \(\bigstar\)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 232658.774875 | 60591.365347 | 3.8398008 | 0.0001560 |
flipper_length_mm | 52.753150 | 2.783871 | 18.9495697 | 0.0000000 |
bill_length_mm | 1.530334 | 5.984567 | 0.2557134 | 0.7983815 |
bill_depth_mm | 28.587515 | 15.997312 | 1.7870199 | 0.0751409 |
year | -119.323909 | 30.224297 | -3.9479466 | 0.0001024 |
\[\begin{align} \mathbb{E}\left[\text{body mass}\right] = 232658.77~ + &~52.75\cdot\left(\text{flipper length}\right) + 1.52\cdot\left(\text{bill length}\right) + \\ &~28.59\cdot\left(\text{bill depth}\right) - 119.32\cdot\left(\text{year}\right) \end{align}\]
\(\bigstar\) Let’s try it! \(\bigstar\)
Does our model include any useful information in predicting / explaining penguin body mass?
\[\begin{array}{lcl} H_0 & : & \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\\ H_a & : & \text{At least one } \beta_i \text{ is non-zero}\end{array}\]
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.7808404 | 0.7773478 | 387.1782 | 223.571 | 0 | 4 | -1886.198 | 3784.396 | 3805.667 | 37626655 | 251 | 256 |
The \(p\)-value on this test is so small that it rounds to 0.
We’ll reject the null hypothesis and accept that our model contains at least one useful predictor of penguin body mass!
\(\bigstar\) Let’s try it! \(\bigstar\)
We’ll run tests \(\begin{array}{lcl} H_0 & : & \beta_i = 0\\ H_a & : & \beta_i \neq 0\end{array}\) to determine whether each individual term has utility.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 232658.774875 | 60591.365347 | 3.8398008 | 0.0001560 |
flipper_length_mm | 52.753150 | 2.783871 | 18.9495697 | 0.0000000 |
bill_length_mm | 1.530334 | 5.984567 | 0.2557134 | 0.7983815 |
bill_depth_mm | 28.587515 | 15.997312 | 1.7870199 | 0.0751409 |
year | -119.323909 | 30.224297 | -3.9479466 | 0.0001024 |
At the \(\alpha = 0.05\) level of significance, neither bill_length_mm
or bill_depth_mm
are significant.
Removing one of the model terms will update all of the values in the model output – including \(p\)-values.
We’ll remove bill_length_mm
since it has the largest \(p\)-value and revisit the resulting table.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 233601.98715 | 60366.73367 | 3.869714 | 0.0001388 |
flipper_length_mm | 53.21694 | 2.10806 | 25.244509 | 0.0000000 |
bill_depth_mm | 29.45370 | 15.60556 | 1.887385 | 0.0602578 |
year | -119.81399 | 30.10749 | -3.979541 | 0.0000903 |
The bill_depth_mm
term is still not significant. Let’s remove it as well.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 227916.26652 | 60596.048460 | 3.76124 | 0.0002101 |
flipper_length_mm | 50.87429 | 1.712514 | 29.70736 | 0.0000000 |
year | -116.49701 | 30.207960 | -3.85650 | 0.0001460 |
Now, all remaining terms are significant.
\[\mathbb{E}\left[\text{body mass}\right] = 227916.27 + 50.87\cdot\left(\text{flipper length}\right) - 116.5\cdot\left(\text{year}\right)\]
\(\bigstar\) Let’s try it! \(\bigstar\)
metric | value |
---|---|
r.squared | 0.7776845 |
adj.r.squared | 0.7759270 |
sigma | 388.4115769 |
statistic | 442.5110973 |
p.value | 0.0000000 |
df | 2.0000000 |
logLik | -1888.0281851 |
AIC | 3784.0563702 |
BIC | 3798.2370800 |
deviance | 38168478.9336123 |
df.residual | 253.0000000 |
nobs | 256.0000000 |
\(p\)-value is rounded to \(0\), so our resulting model still has some utility in predicting / explaining penguin body mass.
\(R^2_{\text{adj}} \approx 77.6\%\), so approximately 77.6% of variation in penguin body mass is explained by our model.
Training RMSE (sigma
) is about 388.41, so we expect our model to predict penguin body mass to within \(\pm 2\cdot\left(388.41\right) \approx \pm 776.82\) grams.
glance()
to obtain the global performance metrics for your model\[\mathbb{E}\left[\text{body mass}\right] = 227916.27 + 50.87\cdot\left(\text{flipper length}\right) - 116.5\cdot\left(\text{year}\right)\]
mlr_fit %>%
augment(penguins_train) %>%
ggplot() +
geom_point(aes(x = flipper_length_mm,
y = body_mass_g),
color = "black",
alpha = 0.5,
size = 2) +
geom_point(aes(x = flipper_length_mm,
y = .pred),
color = "purple",
alpha = 0.9,
size = 2) +
labs(
title = "Penguin Body Mass by Flipper Length",
subtitle = "Observed Values (black), Model Predictions (purple)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
\(\bigstar\) Let’s try it! \(\bigstar\)
.metric | .estimate |
---|---|
rmse | 376.6980462 |
rsq | 0.7450228 |
\(R^2 \approx 74.5\%\), so approximately 74.5% of variation in penguin body mass is explained by our model, when taking away the training data advantage.
Test RMSE is about 376.7, so we expect our model to predict penguin body mass to within \(\pm 2\cdot\left(376.7\right) \approx \pm 753.4\) grams.
As with our simple linear regression models, the test \(R^2\) value is slightly worse than the corresponding training metric, while the test RMSE is an improvement over the training RMSE.
\(\bigstar\) Let’s try it! \(\bigstar\)
Again, some slight right skew in the distribution of residuals and some associations between the residuals and the predictors, predictions, and response.
\(\bigstar\) Let’s try it! \(\bigstar\)
Calculate residuals for your model
augment()
does this for you by default now!Use graphical techniques to analyze the residuals for your model
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 227916.26652 | 60596.048460 | 3.76124 | 0.0002101 |
flipper_length_mm | 50.87429 | 1.712514 | 29.70736 | 0.0000000 |
year | -116.49701 | 30.207960 | -3.85650 | 0.0001460 |
\[\mathbb{E}\left[\text{body mass}\right] \approx 227916.27 + 50.87\cdot\left(\text{flipper length}\right) - 116.5\cdot\left(\text{year}\right)\]
Interpretations:
(Intercept) We expect a penguin from the year 0 and whose flipper length measures 0mm to have a mass of about 227916.27g
(Flipper Length) Holding year constant, we expect a 1mm increase in flipper length to be associated with about a 50.87g increase in penguin body mass, on average.
(Year) Holding flipper length constant, we expect an elapsed year to be associated with about a 116.5g decrease in penguin body mass, on average.
\(\bigstar\) Let’s try it! \(\bigstar\)
Extract the estimated model fit for your multiple linear regressor
Write down the equation for the estimated model
Provide interpretations of each of your estimated coefficients
Consider the following questions:
As with last time, the answer to the first question requires a prediction interval, while answering the second uses a confidence interval.
The prediction interval will be wider due to the additional uncertainty in predicting the body mass of a single penguin.
Our model predicts
\[\begin{align} \mathbb{E}\left[\text{body mass}\right] &\approx 227916.27 + 50.87\left(212\right) - 116.5\cdot\left(2009\right)\\ &= 4652.21\text{g} \end{align}\]
which we again have 0% confidence in.
.pred |
---|
4659.126 |
\(\bigstar\) Let’s try it! \(\bigstar\)
Identify the characteristics of a property whose price you are interested in predicting
Use tibble()
to create a new data frame containing the characteristics of that property whose rental price you wanted to predict
Use your model to predict the rental price of that property
What is the body mass of a penguin observed in 2009, whose flipper length is 212mm?
.pred_lower | .pred_upper |
---|---|
3890.01 | 5428.241 |
We are 95% confident that a single randomly selected penguin from 2009 with a flipper length of 212mm will have a body mass between about 3890g and 5428g.
\(\bigstar\) Let’s try it! \(\bigstar\)
Update the prediction you just made to obtain a prediction interval rather than a point estimate for the estimated rental price
Interpret the resulting interval
What is the average body mass of all penguins observed in 2009 and whose flipper lengths were 212mm?
We are 95% confident that the average body mass of all penguins from 2009 with a flipper length of 212mm is between about 4579g and 4739g.
\(\bigstar\) Let’s try it! \(\bigstar\)
Make another update to obtain a confidence interval for the estimated rental price
Interpret the resulting interval and compare it to the previous interval you constructed
What is the body mass of a penguin observed in 2009 with flipper length 212mm?
What is the average body mass of all penguins observed in 2009 with flipper lengths 212mm?
Residual Analysis and Model Quality