MAT 300 - Applied Statistics II: Regression Analysis
Syllabus (Fall 2024)
Course Description: This is a second course in statistics that builds upon knowledge gained in an introductory statistics course that covers statistical inference. Students will learn to build statistical models and develop skills for implementing regression analysis in real-world problems from engineering, sociology, psychology, science and business. Topics include multiple regression models (including first-order, second-order and interaction models with quantitative and qualitative variables), regression pitfalls, and residual analysis. Additional topics will be covered if time permits. Students will gain experience not only in the mechanics of regression analysis (often by means of a statistical software package) but also in deciding on appropriate models, selecting inferential techniques to answer a particular question, interpreting results and diagnosing problems.
Students in this course will use R, in particular the {tidyverse}
and {tidymodels}
ecosystems, to build and analyze regression models. The course covers simple and multiple linear regression, curvi-linear regression with polynomial and interaction terms, regularization with Ridge Regression and the LASSO, and tree-based models/ensembles. Cross-validation is implemented as an important technique for stable and unbiased model performance estimates, for identifying appropriate levels of model flexibility, and for hyperparameter tuning.
Course Timeline and Notebooks
Below is a tentative timeline for our course. It includes preparatory work that should be done prior to each class meeting, a detailed set of notes corresponding to each class meeting, and assignments following each class meeting. The prepared notebooks use the Palmer penguins
and ames
housing datasets and are provided so that you have a detailed account of each topic we discuss. We’ll learn this content better by doing it that we will by simply reading and running pre-existing code, so we’ll plan to utilize different data in class. For now, I’m planning to start with this data set on rental properties in the San Francisco Bay Area posted to Craigslist, generously made open by Dr. Kate Pennington. We can switch to alternate data sets as student interest dictates. I’ve prepared the following student notes template (html, Quarto) that I hope you’ll use to follow along during our in-class discussions.
A Note on the Slide Decks: I built these slides to be displayed as a split-screen, alongside an open RStudio session. In this way, you can play along by building your own analysis with a different data set (or the same one, if you prefer). If you try displaying the slides across your full screen, the content will flow off the bottom of the page.
Class Meeting | Before Class | During Class | Slides | After Class |
---|---|---|---|---|
1 | Review Syllabus Install R and RStudio |
Introduction and What to Expect) | Slide Version of Intro and Expectations | Finish Software Setup |
2 | Enroll in Competition Read ISLR $\S$ 2.1 (Part I, Part II) |
What is Statistical Learning? Competition Discussion |
Slide Version of Overview | What is an Analytics Report? Competition Assignment 1 Analytics Report Shell (html, Quarto) |
3 | Read ISLR $\S$ 2.3 | Introduction to R: Enter the tidyverse (html, Quarto) |
Companion Slides | |
4 | Read R4DS $\S$ 3.1 - 3.10 (Optional) |
Data Viz and ggplot2 (html, Quarto) |
Companion Slides | Competition Assignment 2 |
5 | R Workshop Day: Quarto and R Quarto Tips |
|||
6 | Data Wrangling Workshop (html, Quarto) | Companion Slides | Homework 1 (html, Quarto) | |
7 | Introduction to {tidymodels} (html, Quarto) |
Companion Slides | ||
8 | Intro Stats Review | Hypothesis Testing and Confidence / Prediction Intervals in Regression (html, Quarto) | Companion Slides | Homework 2 (html, Quarto) |
9 | Read ISLR $\S$ 3.1 (Part I, Part II) | Simple Linear Regression: Construction, Interpretation, and Model Assessment (html, Quarto) |
Companion Slides | Competition Assignment 3 |
10 | Read ISLR $\S$ 3.2 (Part I, Part II) | Multiple Linear Regression: Construction, Interpretation, and Model Assessment (html, Quarto) |
Companion Slides | |
11 | Residual Analysis and Model Quality | Companion Slides | ||
12 | Read ISLR $\S$ 3.3 (Part I, Part II) | Categorical Predictors and Interpretations Feature Engineering with step_other() and step_dummy() (html, Quarto) |
Companion Slides | Competition Assignment 4 |
13 | Model Building, Assessment, and Interpretation Workshop | |||
14 | Higher-Order Terms: Curvi-Linear Regression and Polynomial Terms with step_poly() (html, Quarto) |
Companion Slides | Competition Assignment 5 | |
15 | Higher-Order Terms: Interaction with step_interact() (html, Quarto) |
Companion Slides | ||
16 | Inference and Interpretation with {marginaleffects} |
Companion Slides | ||
17 | Halloween Modeling Competition (In Class, 75-minutes) |
|||
18 | Read ISLR $\S$ 2.2 | Bias/Variance Trade-Off and Model Performance Concerns (html, Quarto) | Companion Slides | |
19 | Read ISLR $\S$ 5.1 (Part I, Part II, Part III) | Performance Concerns Continued: Different Test, Different Expectations Cross-Validation and Unbiased Model Performance (html, Quarto) |
Companion Slides | Homework 3 |
20 | Cross-Validation Workshop | |||
21 | Read ISLR $\S$ 6.1, 6.2 (Part IV, Part V, Part VI, Part VII) |
Variable Selection Methods: Stepwise Regression, Ridge Regression, and the LASSO (html, Quarto) |
Companion Slides | |
22 | Other Regressors (html, Quarto) | Companion Slides | Competition Assignment 6 | |
23 | Hyperparameters and Tuning More uses for Cross-Validation (html, Quarto) |
Companion Slides | ||
24 | Hyperparameters, Tuning, and Other Regressors Workshop | |||
25 | Thanksgiving Modeling Competition (In Class, 75-minutes) |
|||
26+ | Projects | Projects | Projects | Projects |
[1] DeCock, Dean (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education Volume 19, Number 3(2011), http://www.amstat.org/publications/jse/v19n3/decock.pdf
[2] Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. doi:10.5281/zenodo.3960218, R package version 0.1.0, https://allisonhorst.github.io/palmerpenguins/.
[3] Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018. Retrieved from https://github.com/katepennington/historic_bay_area_craigslist_housing_posts/blob/master/clean_2000_2018.csv.zip.