January 6, 2026
In MAT300, we’ll often be working with data from contexts in which we don’t have deep subject-matter expertise.
Because of this, we’ll take an initial exploratory stance when working with our data.
Generating hypotheses and then testing those hypotheses on the same observations compromises the validity of inference (see: snooping, fishing, p-hacking, etc.).
For this reason, in our course, we’ll split our data into a training (exploratory) set and at least one validation set.
We’ll conduct exploratory data analyses, generate hypotheses, and train models on the training data.
We’ll assess the performance of those models on the validation set(s).
It is also common to approach a statistics/data problem with pre-generated or pre-registered hypotheses.
These hypotheses are declared prior to any data collection and are justified by theory, prior experience, or justifiable expectations.
In these cases, the training and validation set approach is not necessary.
The investigator can simply proceed with modeling, model assessment, and interpretation using all of the available data.
They may not, however, change their hypotheses or adjust their model after fitting and analysing the one corresponding to their initial hypotheses and still treat the resulting inference as confirmatory.
Sometimes Splitting is Still Necessary: If model predictions are going to be used to inform decision-making, then data splitting is still necessary, even in the confirmatory setting. This is because predictive performance metrics generated from the data the model was trained on will be overly optimistic.
Important Takeaway Point: While we take an exploratory approach in MAT300, all of the model construction, model assessment (particularly significance testing), and model interpretation techniques we learn apply directly in confirmatory settings as well.

Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x}\) to predict \(y\), given \(x\).
Generalized Goal: Build a model \(\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k}\) to predict \(y\) given features \(x_1, \cdots, x_k\).


Always predicting too high!
Capturing the general trend?
Balanced errors?
In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity
\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - y_{\text{pred}_i}\right)^2}\]
In this case, we have \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1\cdot x\) and we find \(\beta_0\) (intercept) and \(\beta_1\) (slope) to minimize the quantity
\[\sum_{i = 1}^{n}{\left(y_{\text{obs}_i} - \left(\beta_0 + \beta_1\cdot x_{\text{obs}_i}\right)\right)^2}\]

\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)

\(\displaystyle{\mathbb{E}\left[y\right] = 3783.21 - 45.3\cdot x}\)
Approach to Model Interpretation: In general, we’ll interpret the intercept (when appropriate) and the expected effect of a unit change in each predictor on the response


Fits old and new observations similarly well
Equation \(\displaystyle{\mathbb{E}\left[y\right] \approx 1202 -4912x + 3156x^2}\)


We don’t want to wait for new data to know we are wrong.

For model and predictions
Training data are random and representative of population.
Residuals (prediction errors) are normally distributed with mean \(\mu = 0\) and constant standard deviation \(\sigma\).
For interpretations of coefficients (statistical learning / inference)
Building models \(\mathbb{E}\left[y\right] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\)
Predicting a numerical response (\(y\)) given features (\(x_i\))
Need data to build the models – some for training, some for validation
Model predictions will be wrong
As long as standard deviation of residuals (prediction errors) is constant, we can build meaningful confidence intervals for predictions
Can interpret models to gain insight into relationships between predictor(s) and response
Predict bicycle rental duration for a city bike share program
Response: duration
Predictors: You have access to nearly 40 explanatory variables which could be useful in predicting the duration of a rental.
Clarification: All work submitted on these competitions must reflect your own analytical thinking and modeling decisions.
Connection to Debrief Meetings: Your submitted competition work will serve as part of the foundation for our debrief meetings. In these meetings, you will be expected to explain and justify your modeling choices, performance results, and interpretations. The purpose of these discussions is to ensure that the submitted work accurately reflects your understanding.
Homework: Start Competition Assignment 1 – join the competition, read the details, download the data, and start writing a Statement of Purpose
Comment: Confirmatory versus Exploratory Workflows
There are two different circumstances from which we can come at statistics and data projects.
Exploratory Settings: Where we don’t yet have well-defined expectations or formal hypotheses generated about associations, patterns, or relationships in our data.
Confirmatory Settings: Where we have generated formal hypotheses or perhaps even officially pre-registered them prior to collecting any data.
The scenario we are in dictates the workflow we must use in order to conduct valid inference.