October 1, 2024
What is Simple Linear Regression?
What are we assuming?
Global tests for model utility and the individual term-based test.
Further model assessment
Model interpretation
Predictions
Question 1 (Inferential): What, if anything, is the relationship between penguin flipper length and body mass?
Question 1 (Predictive): Can we use penguin flipper length to predict body mass?
Question 2 (Inferential): What, if anything, is the relationship between penguin bill depth and body mass?
Question 2 (Predictive): Can we use penguin bill depth to predict body mass?
Each of these questions can be answered by the construction and analysis of a model.
Simple linear regression predicts a response as a linear function of a single predictor variable.
\[\mathbb{E}\left[\text{body mass}\right] = \beta_0 + \beta_1\cdot \left(\text{flipper length}\right)\\ \textbf{or}\\ \mathbb{E}\left[\text{body mass}\right] = \beta_0 + \beta_1\cdot \left(\text{bill depth}\right)\]
\[\mathbb{E}\left[\text{body mass}\right] = -5769 + 49.7\left(\text{flipper length}\right)\\ \textbf{or}\\ \mathbb{E}\left[\text{body mass}\right] = 7697 - 203\left(\text{bill depth}\right)\]
\(\bigstar\) As usual, I recommend that you play along during our discussion! \(\bigstar\)
MAT300
project space\[\mathbb{E}\left[\text{body mass}\right] = \beta_0 + \beta_1\cdot\left(\text{flipper length}\right)\]
Pre-Modeling Assumptions: Penguin body mass is associated with penguin flipper length in a linear manner, independent of all other possible features.
Post-Modeling Assumptions: The following assumptions are made about model errors (residuals), to ensure that using and interpreting the model is appropriate.
Global Test for Model Utility: \(\begin{array}{lcl} H_0 & : & \beta_1 = 0\\ H_a & : &\beta_1 \neq 0\end{array}\)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.7646157 | 0.763689 | 398.8774 | 825.0863 | 0 | 1 | -1895.34 | 3796.68 | 3807.315 | 40412211 | 254 | 256 |
Individual Term-Based Assessment: \(\begin{array}{lcl} H_0 & : & \beta_1 = 0\\ H_a & : &\beta_1 \neq 0\end{array}\)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5768.75714 | 348.042020 | -16.57489 | 0 |
flipper_length_mm | 49.67168 | 1.729255 | 28.72431 | 0 |
For Simple Linear Regression, the Global Test for Model Utility and the Term-based test have the same hypotheses and will have the same \(p\)-value. They are the same test here!
metric | value |
---|---|
r.squared | 0.7646157 |
adj.r.squared | 0.7636890 |
sigma | 398.8774178 |
statistic | 825.0862647 |
p.value | 0.0000000 |
df | 1.0000000 |
logLik | -1895.3397992 |
AIC | 3796.6795984 |
BIC | 3807.3151307 |
deviance | 40412211.3860991 |
df.residual | 254.0000000 |
nobs | 256.0000000 |
\(R^2_{\text{adj}} \approx 76.4\%\), so approximately 76.4% of variation in penguin body mass is explained by variation in flipper length.
Training RMSE (sigma
) is about 398.88, so we expect our model to predict penguin body mass to within \(\pm 2\cdot\left(398.88\right) \approx \pm 797.76\) grams.
\(\bigstar\) Let’s try it! \(\bigstar\)
Obtain the global model performance metrics for your model
glance()
function for thisAnalyse and interpret the result
Obtain the individual term-based model assessment metrics
extract_fit_engine()
hereAnalyse and interpret the result
.metric | .estimate |
---|---|
rmse | 380.8203026 |
rsq | 0.7405649 |
\(R^2 \approx 74.1\%\), so approximately 74.1% of variation in penguin body mass is explained by variation in flipper length, when taking away the training data advantage.
Test RMSE is about 380.82, so we expect our model to predict penguin body mass to within \(\pm 2\cdot\left(380.82\right) \approx \pm 761.64\) grams.
The test \(R^2\) value is slightly worse than the corresponding training metric, while the test RMSE is an improvement over the training RMSE.
\(\bigstar\) Let’s try it! \(\bigstar\)
test
set)The residuals look approximately normal – with some right skew. There does seem to be an association between the residuals and response, predictions, and flipper length though.
Patterns in residual plots indicate that we could make a better model.
\(\bigstar\) Let’s try it! \(\bigstar\)
Attach your model’s predictions to your training data
Compute a residuals (prediction errors) column
Visualize the distribution of residuals
Create a plot between residuals and the response
Create a plot between residuals and the predictions
Create a plot between the residuals and the predictor
Interpret the plots you’ve constructed
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5768.75714 | 348.042020 | -16.57489 | 0 |
flipper_length_mm | 49.67168 | 1.729255 | 28.72431 | 0 |
\[\mathbb{E}\left[\text{body mass}\right] = -5768.76 + 49.67\cdot\left(\text{flipper length}\right)\]
Interpretations:
(Intercept) We expect a penguin whose flipper length measures 0mm to have a mass of about -5768.76g
(Flipper Length) We expect a 1mm increase in flipper length to be associated with about a 49.67g increase in penguin body mass, on average.
\(\bigstar\) Let’s try it! \(\bigstar\)
Extract your model fit
Provide interpretations of the model coefficients
Include your discussions in your Quarto Notebook
Consider the following questions:
The first question is asking about the mass of a single penguin, while the second question is asking about the average mass over all penguins with a particular characteristic.
There is more uncertainty associated with trying to answer the first question.
Our model predicts
\[\begin{align} \mathbb{E}\left[\text{body mass}\right] &= -5768.76 + 49.67\left(212\right)\\ &= 4761.28\text{g} \end{align}\]
as the answer to both – we have 0% confidence in this!
What is the body mass of a penguin whose flipper length is 212mm?
What is the average body mass of all penguins whose flipper length is 212mm?
What is the body mass of a penguin whose flipper length is 212mm?
What is the average body mass of all penguins whose flipper length is 212mm?
Let’s try this!
Write down two predictive questions associated with Air BnB rental prices that could be answered with your model
Can you differentiate versions of those questions that could be answered by confidence versus prediction intervals? What is the difference?
Compute your model’s prediction for a particular value of the predictor you chose
Similarly, compute the lower and upper bounds for bound confidence and prediction intervals for your chosen value
(Challenge) Can you plot the result?
Interpret your model’s predictions