September 6, 2024
Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x}\) to predict \(y\), given \(x\).
Generalized Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k}\) to predict \(y\) given features \(x_1, \cdots, x_k\).
Always predicting too high!
Capturing the general trend?
Balanced errors?
In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity
\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - y_{\text{pred}_i}\right)^2}\]
In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity
\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - \left(\beta_0 + \beta_1\cdot x_{\text{obs}_i}\right)\right)^2}\]
\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)
\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)
Approach to Model Interpretation: In general, we’ll interpret the intercept (when appropriate) and the expected effect of a unit change in each predictor on the response
Fits old and new observations similarly well
Equation \(\displaystyle{\mathbb{E}\left[y\right] \approx 1202 -4912x + 3156x^2}\)
We don’t want to wait for new data to know we are wrong.
For model and predictions
Training data are random and representative of population.
Residuals (prediction errors) are normally distributed with mean \(\mu = 0\) and constant standard deviation \(\sigma\).
For interpretations of coefficients (statistical learning / inference)
Building models \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\)
Predicting a numerical response (\(y\)) given features (\(x_i\))
Need data to build the models – some for training, some for validation
Model predictions will be wrong
As long as standard deviation of residuals (prediction errors) is constant, we can build meaningful confidence intervals for predictions
Can interpret models to gain insight into relationships between predictor(s) and response
Predict Door Dash delivery times
Response: delivery time
Predictors: market_id
, order_time
, delivery_time
, store_id
, cuisine_type
, order_protocol
, items_in_order
, subtotal_cost
, distinct_items_in_order
, min_item_price
, max_item_price
, dashers_working
, busy_dashers
, outstanding_orders
, model_1_estimate
, model_2_estimate
Homework: Start Competition Assignment 1 – join the competition, read the details, download the data, and start writing a Statement of Purpose