MAT 370: Least Squares Regression

Motivation and Context

Often we must recognize that relationships are neither deterministic nor causal.

Noisy associations exist between one or more independent variables and a response.

In these cases, building an interpolant doesn’t make sense. Instead, we want to capture a general trend.

Interpolant

Least Squares Fit

The interpolant is doing exactly what its designed to do – pass through every observed data point exactly, but that forces unsupported variability (even with using a cubic spline interpolant). We probably don’t realistically expect new data to follow the interpolant – the model on the right is more conservative, but likely more trustworthy.

Overview

With admittedly noisy data and noisy relationships, we want to build a model that captures a general trend between the availabile independent variables and our response.

That model should have a fairly simple form, otherwise we risk fitting the noise, which is unpredictable by definition.

Consider a function \(f\left(x\right) = f\left(x\mid \beta_0, \beta_1,\cdots,\beta_m\right)\) which has been fitted using \(n+1\) observed data points of the form \(\left(\vec{x}_i, y_i\right)\).

This function includes \(m+1\) parameters (\(\beta_0, \cdots, \beta_m\)), so \(m+1 \leq n+1\).

The observed points \(\left(\vec{x}_i, y_i\right)\) consist of measurements on independent variables which are contained in \(\vec{x}_i\) and a corresponding measured dependent response contained in \(y_i\).

Note that \(\vec{x}_i\) may consist of a single measured variable or many.
For example, if the data represent the displacements \(y_i\) of an overdamped mass-spring system at time \(t\), then the observations are of the form \(\left(t_i, y_i\right)\) and the form of the model suggested by theory is \(\displaystyle{f\left(t\right) = a_0te^{-a_1t}}\).
- The parameters \(a_0\) and \(a_1\) for the model are to be learned from the observed data (these take the place of the \(\beta\)s referenced above).

Fitting / Learning Parameters (Intuition)

In general, parameters for a fitted model are obtained by minimizing a loss function.

If we are willing to assume that the noise is a feature of the response variable only (and the measurements on the independent variable(s) are to be trusted), then the most common loss function is the Sum of Squared Errors:

\[L\left(\beta_0, \beta_1,\cdots,\beta_m\right) = \sum_{i=0}^{n}{\left[y_i - f\left(x_i\right)\right]^2}\]

Models fit by minimizing the loss function above are said to be fit using Ordinary Least Squares (OLS).

Fitting / Learning Parameters (Mechanics)

The values of the \(\beta\)-parameters minimizing any Loss Function will satisfy the simultaneous system:

\[\left\{\frac{\partial L}{\partial \beta_j} = 0,~~\text{for}~~ j = 0, 1,\cdots, m\right.\]

Depending on the form of the model \(f\left(x\right)\) the equations in the system above may be non-linear and difficult to solve.

It is common to choose \(f\left(x\right)\) to be a linear combination of base functions \(f_i\left(x\right)\) so that

\[f\left(x\right) = \beta_0f_0\left(x\right) + \beta_1f_1\left(x\right) + \cdots + \beta_mf_m\left(x\right)\]

Doing this forces the simultaneous system to be linear in the \(\beta_i\) values.

As an example, if the fitted function is to be a polynomial, then we have \(f_0\left(x\right) = 1\), \(f_1\left(x\right) = x\), \(f_2\left(x\right) = x^2\), and so on. Resulting in

\[f\left(x\right) = \beta_0 + \beta_1x + \beta_2x^2 + \cdots + \beta_mx^m\]

Measuring Error

Because we seek to capture a general trend, we know that the fitted model will not pass through all of the observed data points.

We can define the standard error of the model residuals (prediction errors) to be

\[s_E = \sqrt{\left(\frac{L}{n - m}\right)}\]

where \(L\) denotes the loss of the fitted function over all the observed data.

If \(n+1 = m+1\) (if there is a parameter for every observation) then the model is an interpolant and the \(s_E\) is undefined since it takes a “zero over zero” form.

Fitting a Straight-Line Model

A simple linear regression model is a model of the form \(f\left(x\right) = \beta_0 + \beta_1x\) which is fit to observed data of the form \(\left(x_i, y_i\right)\) by minimizing the sum of squared residuals.

In this case we can analyze our Loss function as follows:

\[\begin{align*} L\left(\beta_0, \beta_1\right) &= \sum_{i = 0}^{n}{\left[y_i - f\left(x_i\right)\right]^2} \end{align*}\]

Fitting a Straight-Line Model

A simple linear regression model is a model of the form \(f\left(x\right) = \beta_0 + \beta_1x\) which is fit to observed data of the form \(\left(x_i, y_i\right)\) by minimizing the sum of squared residuals.

In this case we can analyze our Loss function as follows:

\[\begin{align*} L\left(\beta_0, \beta_1\right) &= \sum_{i = 0}^{n}{\left[y_i - f\left(x_i\right)\right]^2}\\ &= \sum_{i = 0}^{n}{\left[y_i - \beta_0 - \beta_1x_i\right]^2} \end{align*}\]

We can minimize \(L\left(\beta_0, \beta_1\right)\) by solving the system:

\[\left\{\begin{array}{lcl} \frac{\partial L}{\partial \beta_0} & = & 0\\ \frac{\partial L}{\partial \beta_1} & = & 0\end{array}\right.\]

\[\implies \left\{\begin{array}{lcl} \sum{-2\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{-2x_i\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\end{array}\right.\]

Fitting a Straight-Line Model

A simple linear regression model is a model of the form \(f\left(x\right) = \beta_0 + \beta_1x\) which is fit to observed data of the form \(\left(x_i, y_i\right)\) by minimizing the sum of squared residuals.

In this case we can analyze our Loss function as follows:

\[\begin{align*} L\left(\beta_0, \beta_1\right) &= \sum_{i = 0}^{n}{\left[y_i - f\left(x_i\right)\right]^2}\\ &= \sum_{i = 0}^{n}{\left[y_i - \beta_0 - \beta_1x_i\right]^2} \end{align*}\]

We can minimize \(L\left(\beta_0, \beta_1\right)\) by solving the system:

\[\left\{\begin{array}{lcl} \frac{\partial L}{\partial \beta_0} & = & 0\\ \frac{\partial L}{\partial \beta_1} & = & 0\end{array}\right.\]

\[\implies \left\{\begin{array}{lcl} \sum{-2\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{-2x_i\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \sum{\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{\left(x_iy_i - \beta_0x_i - \beta_1x_i^2\right)} & = & 0\end{array}\right.\]

Fitting a Straight-Line Model

A simple linear regression model is a model of the form \(f\left(x\right) = \beta_0 + \beta_1x\) which is fit to observed data of the form \(\left(x_i, y_i\right)\) by minimizing the sum of squared residuals.

In this case we can analyze our Loss function as follows:

\[\begin{align*} L\left(\beta_0, \beta_1\right) &= \sum_{i = 0}^{n}{\left[y_i - f\left(x_i\right)\right]^2}\\ &= \sum_{i = 0}^{n}{\left[y_i - \beta_0 - \beta_1x_i\right]^2} \end{align*}\]

We can minimize \(L\left(\beta_0, \beta_1\right)\) by solving the system:

\[\left\{\begin{array}{lcl} \frac{\partial L}{\partial \beta_0} & = & 0\\ \frac{\partial L}{\partial \beta_1} & = & 0\end{array}\right.\]

\[\implies \left\{\begin{array}{lcl} \sum{-2\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{-2x_i\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \sum{\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{\left(x_iy_i - \beta_0x_i - \beta_1x_i^2\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \sum{\left(\frac{y_i}{n+1} - \frac{\beta_0}{n+1} - \frac{\beta_1x_i}{n+1}\right)} & = & 0\\ \sum{\left(\frac{x_iy_i}{n+1} - \frac{\beta_0x_i}{n+1} - \frac{\beta_1x_i^2}{n+1}\right)} & = & 0\end{array}\right.\]

Fitting a Straight-Line Model

A simple linear regression model is a model of the form \(f\left(x\right) = \beta_0 + \beta_1x\) which is fit to observed data of the form \(\left(x_i, y_i\right)\) by minimizing the sum of squared residuals.

In this case we can analyze our Loss function as follows:

\[\begin{align*} L\left(\beta_0, \beta_1\right) &= \sum_{i = 0}^{n}{\left[y_i - f\left(x_i\right)\right]^2}\\ &= \sum_{i = 0}^{n}{\left[y_i - \beta_0 - \beta_1x_i\right]^2} \end{align*}\]

We can minimize \(L\left(\beta_0, \beta_1\right)\) by solving the system:

\[\left\{\begin{array}{lcl} \frac{\partial L}{\partial \beta_0} & = & 0\\ \frac{\partial L}{\partial \beta_1} & = & 0\end{array}\right.\]

The top equation in the last line to the right indicates that \(\beta_0 = \bar{y} - \beta_1\bar{x}\).

\[\implies \left\{\begin{array}{lcl} \sum{-2\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{-2x_i\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \sum{\left(y_i - \beta_0 - \beta_1x_i\right)} & = & 0\\ \sum{\left(x_iy_i - \beta_0x_i - \beta_1x_i^2\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \sum{\left(\frac{y_i}{n+1} - \frac{\beta_0}{n+1} - \frac{\beta_1x_i}{n+1}\right)} & = & 0\\ \sum{\left(\frac{x_iy_i}{n+1} - \frac{\beta_0x_i}{n+1} - \frac{\beta_1x_i^2}{n+1}\right)} & = & 0\end{array}\right.\] \[\implies \left\{\begin{array}{lcl} \bar{y} - \beta_0 - \beta_1\bar{x} & = & 0\\ - \beta_0\bar{x} + \sum{\left(\frac{x_iy_i}{n+1} - \frac{\beta_1x_i^2}{n+1}\right)} & = & 0\end{array}\right.\]

We can substitute this into the bottom equation and use some algebra to arrive at \(\displaystyle{\beta_1 = \frac{\sum{y_i\left(x_i - \bar{x}\right)}}{\sum{x_i\left(x_i - \bar{x}\right)}}}\). (\(\bigstar\) – the algebra required is included at the end of this slide deck for those interested)

Fitting General Linear Forms

Consider the least-squares fit

\[\begin{align*}f\left(x\right) &= \beta_0f_0\left(x\right) + \beta_1f_1\left(x\right) + \beta_2f_2\left(x\right) + \cdots + \beta_mf_m\left(x\right)\\ &= \sum_{j=0}^{m}{\beta_j f_j\left(x\right)} \end{align*}\]

Substituting this into our least squares loss function gives

\[L\left(\beta_0, \beta_1, \cdots, \beta_m\right) = \sum_{i=0}^{n}\left[y_i - \sum_{j=0}^{m}{\beta_j f_j\left(x_i\right)}\right]^2\]

Which is minimized by the solution to the following linear system: