Topic 19: Linear Regression (Lab)
This lab introduces simple and multiple linear regression. We’ll analyze data from a study on student course evaluations at the University of Texas at Austin, exploring how instructor and course characteristics relate to evaluation scores.
This is a derivative of a product of OpenIntro released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab was adapted from a lab written by Mine Çetinkaya-Rundel and Andrew Bray.
An Introduction to Linear Regression
Consider the inference tasks we’ve worked through so far. We’ve compared numerical or categorical variables across one or two populations, and extended that to three or more groups with ANOVA and \(\chi^2\). In those cases, the grouping variable was always categorical. Now we ask: what if both variables are numerical? Can we ask whether there is an association between a numerical \(X\) and a numerical \(Y\)? The answer is yes — and the technique is called linear regression.
Let’s check in with a few short videos from OpenIntro to develop the idea.
Simple linear regression uses a single numerical predictor to predict a numerical response. The model takes the form of a straight line:
\[\mathbb{E}[y] = \beta_0 + \beta_1 x\]
where \(\beta_0\) is the intercept and \(\beta_1\) is the slope. The full model includes an error term \(\varepsilon\) representing unexplained noise, but since we assume \(\varepsilon \sim N(0, \sigma)\), we typically write the model in terms of the expected (average) response. Regression models are most reliable for interpolation — making predictions within the range of observed predictor values — and should be used with caution for extrapolation beyond that range.
The Data
Many college courses conclude by giving students the opportunity to evaluate the course and instructor anonymously. However, the use of these evaluations as indicators of teaching effectiveness is often criticized because they may reflect non-teaching-related characteristics such as the physical appearance of the instructor. The article “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors perceived as more attractive receive higher instructional ratings.
In this lab we analyze data from that study to understand what goes into a positive professor evaluation. The data were gathered from end-of-semester student evaluations for a large sample of professors at the University of Texas at Austin. Six students also rated each professor’s physical appearance. Each row of the dataset represents a different course.
The evals dataset contains the following variables:
| Variable | Description |
|---|---|
score |
average professor evaluation score: (1) very unsatisfactory – (5) excellent |
rank |
rank of professor: teaching, tenure track, tenured |
ethnicity |
ethnicity of professor: not minority, minority |
gender |
gender of professor: female, male |
language |
language of school where professor received education: english or non-english |
age |
age of professor |
cls_perc_eval |
percent of students in class who completed evaluation |
cls_did_eval |
number of students in class who completed evaluation |
cls_students |
total number of students in class |
cls_level |
class level: lower, upper |
cls_profs |
number of professors teaching sections in course in sample: single, multiple |
cls_credits |
number of credits of class: one credit (lab, PE, etc.), multi credit |
bty_f1lower |
beauty rating from lower-level female student: (1) lowest – (10) highest |
bty_f1upper |
beauty rating from upper-level female student |
bty_f2upper |
beauty rating from second upper-level female student |
bty_m1lower |
beauty rating from lower-level male student |
bty_m1upper |
beauty rating from upper-level male student |
bty_m2upper |
beauty rating from second upper-level male student |
bty_avg |
average beauty rating of professor |
pic_outfit |
outfit of professor in picture: not formal, formal |
pic_color |
color of professor’s picture: color, black & white |
What is the difference between an observational study and an experiment?
Is this an observational study or an experiment?
The original research question asks whether beauty leads directly to differences in course evaluations. Given the study design, is it possible to answer this question as phrased?
Exploratory Analysis
Use the code block below to draw a histogram of the score variable in the evals data frame. Include the following labels for the plot:
labs(
title = "Distribution of Course Evaluation Scores",
x = "Score",
y = ""
)Pipe the evals data frame into ggplot().
Add a geom_histogram() layer. For a histogram, only map x — the heights of the bars are computed automatically from the data.
evals |>
ggplot() +
geom_histogram(aes(x = ___))Don’t forget to add the labels.
evals |>
ggplot() +
geom_histogram(aes(x = score)) +
labs(
title = "Distribution of Course Evaluation Scores",
x = "Score",
y = ""
)
evals |>
ggplot() +
geom_histogram(aes(x = score)) +
labs(
title = "Distribution of Course Evaluation Scores",
x = "Score",
y = ""
)
evals |>
ggplot() +
geom_histogram(aes(x = score)) +
labs(
title = "Distribution of Course Evaluation Scores",
x = "Score",
y = ""
)Describe the distribution of score.
What does this tell you about how students typically rate courses?
Use the code block below to explore relationships between other variables in the evals data frame. Try grouped summaries and additional plots as you see fit.
Try group_by() and summarize() to compare numerical summaries across groups, or use ggplot() to construct boxplots, scatterplots, or histograms for variables that interest you.
Simple Linear Regression
The fundamental phenomenon suggested by Hamermesh and Parker is that better-looking instructors receive higher evaluation scores. Create a scatterplot with bty_avg on the horizontal axis and score on the vertical axis to see whether this appears to be the case. Include the following labels:
- Title: Association between Attractiveness and Evaluation Score
- x-axis: Average Beauty Rating
- y-axis: Course Evaluation Score
Pipe evals into ggplot() and use geom_point() for a scatterplot. You’ll need both x and y aesthetic mappings.
Since the researchers suspect attractiveness impacts evaluation score, use bty_avg as x and score as y.
evals |>
ggplot() +
geom_point(aes(x = ___, y = ___))evals |>
ggplot() +
geom_point(aes(x = bty_avg, y = score)) +
labs(
title = "Association between Attractiveness and Evaluation Score",
x = "Average Beauty Rating",
y = "Course Evaluation Score"
)
evals |>
ggplot() +
geom_point(aes(x = bty_avg, y = score)) +
labs(
title = "Association between Attractiveness and Evaluation Score",
x = "Average Beauty Rating",
y = "Course Evaluation Score"
)
evals |>
ggplot() +
geom_point(aes(x = bty_avg, y = score)) +
labs(
title = "Association between Attractiveness and Evaluation Score",
x = "Average Beauty Rating",
y = "Course Evaluation Score"
)Before drawing conclusions, compare the number of observations in evals with the number of visible points in your scatterplot. Does something seem off? Use the code block below to replot using geom_jitter() instead of geom_point(). What was misleading about the original scatterplot?
Copy the code from the previous scatterplot and replace geom_point() with geom_jitter(). The geom_jitter() layer adds a small amount of random noise to each point’s position to prevent overplotting.
Now let’s fit a linear model. The code block below builds a simple linear regression model predicting evaluation score from average beauty rating. Run it and use the summary output to answer the questions that follow.
Is average beauty rating a statistically significant predictor of overall evaluation score?
What is the approximate value of the intercept?
What is the slope of the regression model with respect to average beauty rating?
Use the code block below to add the regression line to your jittered scatterplot using geom_abline() with the slope and intercept from the model summary.
Start with the jittered scatterplot you produced earlier and add a geom_abline() layer.
The slope and intercept arguments in geom_abline() are not mapped from data columns, so they go outside of any aes() call.
evals |>
ggplot() +
geom_jitter(aes(x = bty_avg, y = score)) +
geom_abline(slope = ___, intercept = ___) +
labs(
title = "Association between Attractiveness and Evaluation Score",
x = "Average Beauty Rating",
y = "Course Evaluation Score"
)Use the slope and intercept values from the model summary.
evals |>
ggplot() +
geom_jitter(aes(x = bty_avg, y = score)) +
geom_abline(slope = 0.06664, intercept = 3.88034) +
labs(
title = "Association between Attractiveness and Evaluation Score",
x = "Average Beauty Rating",
y = "Course Evaluation Score"
)Write out the equation for the linear model and interpret the slope in context.
From the plot, does average beauty rating seem to be a practically significant predictor of evaluation score?
What does it mean that average beauty rating is a statistically significant but not practically significant predictor of evaluation score?
Multiple Linear Regression
We now expand the model to include additional predictors so we can better understand which instructor and course characteristics, on average, lead to the highest evaluation scores. We’ll start with a large model and then use backward elimination — removing the least significant predictor one at a time — until all remaining predictors are statistically significant.
First, let’s check in with Dr. Çetinkaya-Rundel again for a brief introduction to multiple regression.
Multiple regression generalizes simple regression by allowing more than one predictor:
\[\mathbb{E}[y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\]
Predictors may be numerical (each coefficient is a slope, describing the expected change in the response per one-unit increase in that predictor while all others constant) or categorical (each coefficient is a shift in the intercept relative to a reference level).
Model quality is often measured with adjusted \(R^2\), which captures the proportion of variability in the response explained by the model, while penalizing unnecessary complexity. Values closer to 1 are better.
You’ll notice that rather than including all six beauty rating variables, we include only bty_avg. This is because the individual beauty ratings are highly correlated with one another — they encode essentially the same information. Including highly correlated predictors can cause problems for regression. Look for a full course in regression analysis to learn more.
Run the code block below to build and inspect the full model.
Using \(\alpha = 0.05\), which predictor variables in the full model are not statistically significant? Select all that apply.
How many predictor variables should be removed before re-running the model?
When a predictor is removed, all of the remaining coefficients, standard errors, and \(p\)-values change. A predictor that appeared insignificant in the full model may become significant once another predictor is removed. Backward elimination removes predictors one at a time — always dropping the one with the highest \(p\)-value — to account for this.
Use the code block below to remove the predictor with the highest \(p\)-value from the full model, re-run it, and inspect the summary.
Which predictor had the highest \(p\)-value in the full model? Delete it from the formula in the lm() call.
cls_profs had the highest \(p\)-value in the full model. Remove it from the formula.
m_reduce1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_students + cls_level + cls_credits +
bty_avg + pic_color + pic_outfit,
data = evals)
summary(m_reduce1)
m_reduce1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_students + cls_level + cls_credits +
bty_avg + pic_color + pic_outfit,
data = evals)
summary(m_reduce1)
m_reduce1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_students + cls_level + cls_credits +
bty_avg + pic_color + pic_outfit,
data = evals)
summary(m_reduce1)Continue the backward elimination process in the code block below, removing one predictor at a time until all remaining predictors are statistically significant.
Start with the model from the previous code block. Identify the predictor with the highest \(p\)-value, remove it, re-run, and inspect the output. Repeat until all predictors are significant.
After removing cls_profs, the next candidates to check are cls_students, cls_level, ethnicity, and pic_outfit. Which has the highest \(p\)-value?
Once you’ve identified the final model, use the code block below to build it as m_final and print the summary.
Copy and paste the final version of your model from the previous code block, rename it m_final, and run summary(m_final).
m_final <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_credits + bty_avg + pic_color,
data = evals)
summary(m_final)
m_final <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_credits + bty_avg + pic_color,
data = evals)
summary(m_final)Based on the final model, select the characteristics associated with a higher predicted evaluation score. Select all that apply.
As with the simple regression model, we’ve identified several statistically significant predictors of course evaluation score — but statistical significance does not imply practical significance.
What we have replicated is the core finding of Hamermesh and Parker: there are meaningful implicit biases embedded in the instrument used to measure teaching quality. The fact that instructor ethnicity, gender, language background, attractiveness, and picture format all have explanatory value in predicting evaluation scores raises serious questions about what these evaluations actually measure.
If you found this lab interesting, linear regression is just the beginning of a powerful family of statistical modeling techniques. Look for full courses in regression analysis, predictive modeling, statistical learning, or machine learning to go deeper.
Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.
You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.
Summary
- Simple linear regression models the relationship between a single numerical predictor \(x\) and a numerical response \(y\) using a straight line: \(\mathbb{E}[y] = \beta_0 + \beta_1 x\). The slope \(\beta_1\) describes the expected change in \(y\) per one-unit increase in \(x\).
- Statistical significance and practical significance are not the same thing. A predictor can be statistically significant — meaning the data provide evidence that the true coefficient is nonzero — while explaining very little of the variability in the response and yielding poor predictions.
- Multiple linear regression extends the simple model to include multiple predictors, which may be numerical or categorical. Each numerical predictor’s coefficient is a slope; categorical predictor coefficients shift the intercept.
- Backward elimination is one strategy for building a parsimonious model: start with all predictors, remove the least significant one at a time, and reassess after each removal. Removing only one predictor at a time is important because all coefficients and \(p\)-values change when a predictor is dropped.
- Observational data cannot establish causation. The finding that beauty rating predicts evaluation score does not mean that attractiveness causes higher ratings — it means there is an association. An experiment with random assignment would be needed to establish causality.
This lab completes the first pass through the full introductory statistics curriculum. If you’ve completed all of these activities, then you’ve traveled from data types and sampling all the way through linear regression — building a coherent framework for asking questions with data, quantifying uncertainty, and drawing defensible conclusions. The tools you’ve learned here are the foundation for more advanced work in regression modeling, machine learning, causal inference, and beyond. Well done!
I’m planning to develop similar series’ of activities for other courses. I hope you’ll check back in to see if I’ve got anything else that can be useful to you.