MAT 434 - Statistical Learning and Classification (with R)
 Spring 2025 Syllabus
Spring 2025 Syllabus
Course Description: Using the foundational knowledge built in MAT 240/241 and MAT 300, we continue our study of statistical models. This course moves beyond regression and into classification models, mixed models, and unsupervised learning. Like MAT300, this course also emphasizes cross-validation as an important method for  hyperparameter tuning, identifying appropriate levels of model flexibility, approximating future model performance, and analyzing the utility of a model. This course covers logistic regression, support vector machines, k nearest neighbors, tree-based methods (bagging, boosting, and random forests), and neural networks. We also cover techniques for dimension reduction and working with text-based features. In addition to the statistical modeling coursework, students will be exposed to GitHub for collaboration and version control and will use GitHub pages to build and populate a professional profile for sharing their work on the web.
Course Timeline and Notebooks
Below is a tentative timeline for our course. The table includes preparatory work that should be read prior to each class meeting, a description of what to expect during our class meeting, and assignments following each class meeting. I’m taking a more free-form approach to MAT434 than we took in MAT300, where I provided you with detailed notebooks or slides prior to each class meeting. In MAT434, we’ll be building our notebooks in class, exploring several different data sets as our semester goes on. While the main topics for each class meeting are determined, you all will be dictating the direction of our analyses, the choices we make during model construction, and the corresponding discussions we end up having.
| Class Meeting | Dataset | Before Class | During Class | After Class | 
|---|---|---|---|---|
| 1 | i) Review Syllabus ii) Software Setup | Day 1 Slides i) Introduction and What to Expect ii) Ethics and Data Models | HW 1 | |
| 2 | FAA Airstrikes and Engine Damage, or MLB Hits and Homeruns | (i) Ensure that gitis working from RStudio | Day 2 As Slides i) R Projects and Version Control ii) Tidy Analyses in R (new students or returning students) | HW 2 CA 1 | 
| 3 | Quarto, inline commands, and semi-automated reporting | HW 3 | ||
| 4 | EDA and Data Viz InClass DataViz Slides | CA 2 | ||
| 5 | GitHub Pages and a public-facing portfolio | HW 4 | ||
| 6 | {tidymodels}Framework (Review) | {tidymodels}Framework Example | ||
| 7 | Regression Versus Classification and Performance Metrics for Classifiers (html or qmd) | HW 5 | ||
| 8 | Spaceship Titanic | Intro to Logistic Regressors (html or qmd) | Binary Classifiers, Part I: Logistic Regression | |
| 9 | Intro to Support Vector Classifiers (html or qmd) | Binary Classifiers, Part II: Support Vector Machines | HW 6 | |
| 10 | Gene Expression and Cancerous Tumors | Intro to Principal Component Analysis (html or qmd) | Aside: High-Dimensional Data and Dimension Reduction | |
| 11 | Healthcare Analytics: Length of Stay (smaller) | Intro to k Nearest Neighbors (html or qmd) | Multiclass Classifiers, Part I: Nearest Neighbors | CA 3 | 
| 12 | Intro to Decision Trees (html or qmd) | Multiclass Classifiers, Part II: Decision Tree Classifiers | ||
| 13 | Intro to Ensembles, Bagging, and Random Forests (html or qmd) | Ensembles, Part I: Bagging and Random Forests | ||
| 14 | Intro to Boosting (html or qmd) | Ensembles, Part II: Boosting | CA 4 | |
| 15 | Work on GitHub Page | |||
| 16 | Cyberbullying, | Intro to Text and Tokenization (html or qmd) | Text Features, Part I: Tokenization | |
| 17 | Intro to Regular Expressions (html or qmd) | Text Features, Part II: Regex | ||
| 18 | St. Patrick’s Day Competition | St. Patrick’s Day Classification Challenge (InClass Kaggle Competition) | ||
| 19 | Intro to Word Embeddings (html or qmd) | Text Features, Part III: Embeddings | CA 5 | |
| 20 | Fashion MNIST | Install TensorFlow | Deep Learning, Part I: Architecture | |
| 21 | Deep Learning, Part II: Activation Functions | |||
| 22 | Deep Learning, Part III: Training and Assessment | CA 6 | ||
| 23+ | Final Projects | 
References
[1] Tibshirani et. al., Introduction to Statistical Learning (2021)
[2] MLB Hits and Homeruns data set taken from Sliced Data Science Competition (Season 1, Episode 9)
[3] Airstrikes and Engine Damage data set posted to Kaggle by the FAA and Abigail Larion in 2016
[4] Spaceship Titanic data set taken from Kaggle Getting Started Competition
[5] Gene Expression and Cancerous Tumors data set retrieved from Synapse.org and is maintained by the Cancer Genome Project
[6] Cyberbullying data set posted to Kaggle by user LARXEL in 2021