Overview of Statistical Learning (and Competition Overview)

Dr. Gilbert

September 6, 2024

Statistical Learning in Pictures

Statistical Learning in Pictures

Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x}\) to predict \(y\), given \(x\).
Generalized Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k}\) to predict \(y\) given features \(x_1, \cdots, x_k\).
- \(\beta_i\)’s are parameters, learned from training data.

Statistical Learning in Pictures

This model doesn’t capture the general trend between our observed \(x\) and \(y\) pairs.

Statistical Learning in Pictures

Better job capturing the general trend (sort of).
Larger \(x\) values are associated with smaller \(y\) values.

Statistical Learning in Pictures

Always predicting too high!
- We should overpredict sometimes and underpredict others. The average error should be \(0\).

Statistical Learning in Pictures

Capturing the general trend?
- …mostly
Balanced errors?
- ✓

How it works

In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity

\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - y_{\text{pred}_i}\right)^2}\]

How it works

In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity

\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - \left(\beta_0 + \beta_1\cdot x_{\text{obs}_i}\right)\right)^2}\]

Changing \(\beta_0\) and/or \(\beta_1\) will change this sum.

How it works

Interpreting the Model

\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)

The expected value of \(y\) when \(x = 0\) is \(3783.21\).
As \(x\) increases, we expect \(y\) to decrease (on average).
Given a unit increase in \(x\), we expect \(y\) to decrease by about \(45.3\) units.

Interpreting the Model

\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)

The expected value of \(y\) when \(x = 0\) is \(3783.21\).
As \(x\) increases, we expect \(y\) to decrease (on average).
Given a unit increase in \(x\), we expect \(y\) to decrease by about \(45.3\) units.

Approach to Model Interpretation: In general, we’ll interpret the intercept (when appropriate) and the expected effect of a unit change in each predictor on the response

Can we find a better model?

Can we interpret this model?

The equation is \(\mathbb{E}\left[y\right] \approx 1202 - 4911x +3156x^2 + 784x^3 +\\ 409x^4 -215x^5 -7x^6 -516x^7\)
No thanks…

Do we expect this model to generalize well?

Especially near \(x = 0\) and \(x = 100\)

Is there a happy medium?

Is there a happy medium?

Fits old and new observations similarly well
Equation \(\displaystyle{\mathbb{E}\left[y\right] \approx 1202 -4912x + 3156x^2}\)
- We’ll be able to interpret this

How do we know what model is right?

The purple model is too straight
The orange model is too wiggly
The green model is just right

How do we know what model is right?

We don’t want to wait for new data to know we are wrong.
- Use some of our available data for training
- And the rest for validation

Okay, but our predictions are all wrong…literally!

All models are wrong, but some are useful, George Box (1976)
Predictions will be wrong but, with some assumptions, they have value

Necessary Assumptions

For model and predictions

Training data are random and representative of population.
- Otherwise, we should not be modeling this way.
Residuals (prediction errors) are normally distributed with mean \(\mu = 0\) and constant standard deviation \(\sigma\).
- Allows construction of confidence intervals around predictions (making our models right).

Necessary Assumptions

For interpretations of coefficients (statistical learning / inference)

No multicollinearity (predictors aren’t correlated with one another)

Summary

Building models \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\)
Predicting a numerical response (\(y\)) given features (\(x_i\))
Need data to build the models – some for training, some for validation
- The \(\beta_i\)’s are parameters whose values are learned/estimated from training data
Model predictions will be wrong
As long as standard deviation of residuals (prediction errors) is constant, we can build meaningful confidence intervals for predictions
Can interpret models to gain insight into relationships between predictor(s) and response

Competition Information

Predict Door Dash delivery times
- Time from placing order to delivery
Response: delivery time
Predictors: market_id, order_time, delivery_time, store_id, cuisine_type, order_protocol, items_in_order, subtotal_cost, distinct_items_in_order, min_item_price, max_item_price, dashers_working, busy_dashers, outstanding_orders, model_1_estimate, model_2_estimate

How this works

Six assignments to guide you.
You’ll take data provided by Door Dash and build models.
Kaggle will assess the predictions from everyone’s models.
Live leaderboard, using only part of the competition data so you know approximately where you stand.
You’ll talk with eachother about why/how you have different scores.
People might share their strategies, or they might not – this is a competition after all.

Why?

Interest in homework assignments generally ends after they’re turned in and graded.
Competition assignments and environment ask you to iterate on previous work.
You’ll almost surely be interested in what other people have done, especially if their models have performed better than yours.
You’ll talk with eachother about strategies and modeling choices.
You’ll be motivated to improve your model even between assignments.

What past students say

The competition is fun
It is motivating
I learned more because I wanted to place better in the competition
Talking with others about their models made me more confident in my understanding of course material

What you are building

An analytics report
You’ll be building models and (more importantly) writing about your modeling choices and the performance of your models
Six assignments – each focusing on part(s) of the modeling process and analytics report
Prepares you for the final project, where you’ll do all this over again on a data set you identified and care about

Next Time…

Way fewer slides! 🤕
An introduction to R
Getting our hands dirty!

Homework: Start Competition Assignment 1 – join the competition, read the details, download the data, and start writing a Statement of Purpose