Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.1. Overview

Introduction

The time has finally come: let’s apply what we’ve learned about loss functions and the modeling recipe to “upgrade” from the constant model to the simple linear regression model.

To recap, our goal is to find a hypothesis function hh such that:

predicted commute timei=h(departure houri)\text{predicted commute time}_i = h(\text{departure hour}_i)
Image produced in Jupyter

So far, we’ve studied the constant model, where the hypothesis function is a horizontal line:

h(xi)=wh(x_i) = w

The sole parameter, ww, controlled the height of the line. Up until now, “parameter” and “prediction” were interchangeable terms, because our sole parameter ww controlled what our constant prediction was.

Now, the simple linear regression model has two parameters:

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

w0w_0 controls the intercept of the line, and w1w_1 controls its slope. No longer is it the case that “parameter” and “prediction” are interchangeable terms, because w0w_0 and w1w_1 control different aspects of the prediction-making process.

How do we find the optimal parameters, w0w_0^* and w1w_1^*? Different values of w0w_0 and w1w_1 give us different lines, each of which fit the data with varying degrees of accuracy.

Image produced in Jupyter

To make things precise, let’s turn to the three-step modeling recipe from Chapter 1.3.

1. Choose a model.

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We’ll stick with squared loss:

Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss (also known as empirical risk) to find optimal parameters.

Average squared loss – also known as mean squared error – for any hypothesis function hh, takes the form:

1ni=1n(yih(xi))2\frac{1}{n} \sum_{i=1}^n (y_i - h(x_i))^2

For the simple linear regression model, this becomes:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Now, we need to find the values of w0w_0 and w1w_1 that together minimize Rsq(w0,w1)R_\text{sq}(w_0, w_1). But what does that even mean?

In the case of the context model and squared loss, where we had to minimize Rsq(w)=1ni=1n(yiw)2 R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2, we did so by taking the derivative with respect to ww and setting it to 0.

Image produced in Jupyter

Rsq(w)R_\text{sq}(w) was a function with just a single input variable (ww), so the problem of minimizing Rsq(w)R_\text{sq}(w) was straightforward, and resembled problems we solved in Calculus 1.

The function Rsq(w0,w1)R_\text{sq}(w_0, w_1) we’re minimizing now has two input variables, w0w_0 and w1w_1. In mathematics, sometimes we’ll write Rsq:R2RR_\text{sq}: \mathbb{R}^2 \to \mathbb{R} to say that RsqR_\text{sq} is a function that takes in two real numbers and returns a single real number.

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Remember, we should treat the xix_i’s and yiy_i’s as constants, as these are known quantities once we’re given a dataset.

What does Rsq(w0,w1)R_\text{sq}(w_0, w_1) even look like? We need three dimensions to visualize it – one axis for w0w_0, one for w1w_1, and one for the output, Rsq(w0,w1)R_\text{sq}(w_0, w_1).

Loading...

The graph above is called a loss surface, even though it’s a graph of empirical risk, i.e. average loss, not the loss for a single data point. The plot is interactive, so you should drag it around to get a sense of what it looks like. It looks like a parabola with added depth, similar to how cubes look like squares with added depth. Lighter regions above correspond to low mean squared error, and darker regions correspond to high mean squared error.

Think of the “floor” of the graph – in other words, the w0w_0-w1w_1 plane – as all the set of possible combinations of intercept and slope. The height of the surface at any point (w0,w1)(w_0, w_1) is the mean squared error of the hypothesis h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i on the commute times dataset.

Our goal is to find the combination of w0w_0 and w1w_1 that get us to the bottom of the surface, marked by the gold point in the plot. This will involve calculus and derivatives, but we’ll need to extend our single variable approach: we’ll need to take partial derivatives with respect to w0w_0 and w1w_1. Chapter 2.2 is a detour that describes how these work; in Chapter 2.3, we’ll use them to find the optimal parameters.


A Preview

Just so you have them, though, here’s what we’ll end up finding:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,w0=yˉw1xˉ\boxed{w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}}

These are formulas that describe the optimal slope, w1w_1^*, and intercept, w0w_0^*, for the simple linear regression model, given a dataset (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n). They are chosen to minimize mean squared error. On our commute times dataset, the resulting line looks like this:

Image produced in Jupyter