Skip to article frontmatterSkip to article content

1.5. Least Squares

The Modeling Recipe

We’ve now made several passes over the three-step modeling recipe, first introduced in Chapter 1.3.

  1. Choose a model.

  2. Choose a loss function.

  3. Minimize average loss to find optimal parameters.

Our very first pass through the recipe involved:

  1. The constant model, h(xi)=wh(x_i) = w.

  2. Squared loss, Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2.

  3. The above two choices gave us average loss, or empirical risk, of

    Rsq(w)=1ni=1n(yiw)2R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2

    which we showed is minimized by w=yˉw^* = \bar{y}.

Then, in Chapter 1.4, we considered:

  1. The simple linear regression model, h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i.

  2. Squared loss, Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2.

  3. The above two choices gave us average loss, or empirical risk, of

    Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

    which we showed – using partial derivatives – is minimized by

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=rσyσx,w0=yˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r \frac{\sigma_y}{\sigma_x}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

Let’s visualize both the best constant model and the best simple linear regression model on the commute times dataset.

Image produced in Jupyter

A fact we mentioned earlier – and you restated in Homework 1 – is that the mean squared error of the best constant model is the variance of the data. We can see this by substituting w=yˉw^* = \bar{y} into mean squared error:

Rsq(w)=1ni=1n(yih)2    Rsq(yˉ)=1ni=1n(yiyˉ)2variance of y=σy2R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - h)^2 \implies R_\text{sq}(\bar{y}) = \underbrace{\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2}_{\text{variance of } y} = \sigma_y^2

This gives us a nice baseline from which to compare the performance of more sophisticated models. If a model we create has a mean squared error that’s higher than the variance of the data, we know we’ve done something wrong, because a constant model would be better!

The mean squared error of the best simple linear regression model on the data is:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0^*, w_1^*) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0^* + w_1^* x_i))^2

This formula alone isn’t particularly enlightening, but it turns out this is equivalent to σy2(1r2)\sigma_y^2 (1 - r^2), where rr is the correlation coefficient between xx and yy. (The proof of this is an optional component of Homework 2.)

This means that the best simple linear regression model can’t perform worse than the best constant model. If r=0r = 0, then the best simple linear regression model has a slope of 0, and is just the best constant model! Otherwise, the best simple linear regression model will have a non-zero slope, which is chosen to minimize mean squared error.

In this particular dataset:

MSE of best constant model=σy2=167.54 minutes2MSE of best simple linear regression model=σy2(1r2)here, r=0.65=97.05 minutes2\text{MSE of {\color{#d81b60}{best constant model}}} = \sigma_y^2 = 167.54 \text{ minutes}^2 \\ \text{MSE of {\color{orange}{best simple linear regression model}}} = \sigma_y^2 \underbrace{(1 - r^2)}_{\text{here, } r = -0.65} = 97.05 \text{ minutes}^2
# Assume x and y are arrays containing 
# departure hours and commute times, respectively.
np.var(y, ddof=0)
167.535147928994
np.corrcoef(x, y)[0, 1]
-0.6486426165832004
np.var(y, ddof=0) * (1 - np.corrcoef(x, y)[0, 1]**2)
97.04687150819171

Least Squares

Least squares is the general term for minimizing mean squared error – or equivalently, the sum of squared errors – to find optimal model parameters. Least squares appears quite frequently outside of machine learning contexts, too.

For example, population ecologists use least squares to model the growth of populations (say, of an animal species) over time.

Image produced in Jupyter

The orange curve above is of the form h(xi)=aebxih(x_i) = ae^{bx_i}, where aa and bb are parameters. I’ve chosen the form aebxiae^{bx_i} because it’s a good fit for the data; it’s how we model exponential growth.

aa and bb could be chosen by minimizing mean squared error directly:

Rsq(a,b)=1ni=1n(yiaebxi)2R_\text{sq}(a, b) = \frac{1}{n} \sum_{i=1}^n (y_i - ae^{bx_i})^2

If we wanted to minimize Rsq(a,b)R_\text{sq}(a, b), we’d need to take partial derivatives with respect to aa and bb, set them to 0, and solve the resulting system of equations. However, the function h(xi)=aebxih(x_i) = ae^{bx_i} involves a parameter (bb) in the exponent, which means the resulting partial derivatives of RsqR_\text{sq} will be complicated and the resulting system of equations will be challenging to solve. (Don’t believe me? Try finding Rsqb\frac{\partial R_\text{sq}}{\partial b}.)

In this particular case, there’s a transformation we can apply to the data so that we can reduce the minimization problem to something we’ve seen before! Consider what happens if we take the logarithm of both sides of the equation yiaebxiy_i \approx ae^{bx_i}:

yiaebxi    log(yi)log(a)+bxiy_i \approx ae^{bx_i} \implies \log(y_i) \approx \log(a) + bx_i

I’ve written the \approx symbol to remind us that yiy_i is not exactly equal to aebxiae^{bx_i}, but rather, the goal is to choose aa and bb so that aebxiae^{bx_i} is approximately equal to yiy_i, across all points ii.

When we take the logarithm of each yy-value in our dataset, the resulting plot looks like:

Image produced in Jupyter

The transformed data looks roughly linear!

How does this transformation help us find aa and bb? Now, instead of minimizing:

Rsq(a,b)=1ni=1n(yiaebxi)2R_\text{sq}(a, b) = \frac{1}{n} \sum_{i=1}^n (y_i - ae^{bx_i})^2

I’m claiming that since log(yi)log(a)+bxi\log(y_i) \approx \log(a) + bx_i, we can instead minimize:

Rsq(a,b)=1ni=1n(log(yi)(log(a)w0+bw1xi))2R'_\text{sq}(a, b) = \frac{1}{n} \sum_{i=1}^n (\log(y_i) - (\underbrace{\log(a)}_{w_0} + \underbrace{b}_{w_1}x_i))^2

RsqR'_\text{sq} is not the derivative of RsqR_\text{sq}; it is a different empirical risk function, with different optimal parameters.

That said, RsqR_\text{sq}' should look familiar: it looks remarkably similar to the mean squared error of the simple linear regression model.

Let’s define zi=log(yi)z_i = \log(y_i), w0=log(a)w_0 = \log(a), and w1=bw_1 = b. Then, we can rewrite RsqR'_\text{sq} as:

Rsq(w0,w1)=1ni=1n(zi(w0+w1xi))2R'_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (z_i - (w_0 + w_1 x_i))^2

We just spent all of Chapter 1.4 finding the optimal values of w0w_0^* and w1w_1^* in the empirical risk function above, so the optimal values of w0w_0^* and w1w_1^* are well known to us!

w1=i=1n(xixˉ)(zizˉ)i=1n(xixˉ)2,w0=zˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(z_i - \bar{z})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{z} - w_1^* \bar{x}

Note that I’ve computed w0=1.24w_0^* = 1.24 and w1=1.92w_1^* = 1.92 using the transformed data, (x1,z1),,(xn,zn)(x_1, z_1), \ldots, (x_n, z_n). On the transformed data, this line looks like:

Image produced in Jupyter

How do we this back into a curve on the original data? To transform the data, we applied the following substitutions that we need to invert:

w0=log(a)    a=ew1b=w1w_0 = \log(a) \implies a = e^{w_1} \\ b = w_1

So, using the above, we can find:

a=ew0=e1.24=3.46b=w1=1.92a^* = e^{w_0^*} = e^{1.24} = 3.46 \\ b^* = w_1^* = 1.92

So, the curve we’re looking for is:

h(xi)=3.46e1.92xih(x_i) = 3.46 e^{1.92x_i}

And this is the equation of the orange curve I first showed you!

Image produced in Jupyter

Not every least squares problem can be reduced to the simple linear regression problem we’ve been studying. But, in the rare cases that it can, it’s satisfying to be able to reuse our work.


What’s Next?

Following Chapter 1.4, the next natural question to ask is, how do we use multiple features (input variables) to make predictions?

When we first introduced the commute times example back in Chapter 1.2, we saw that the dataset we’re working with has a few features beyond the departure hour.

Loading...

Up until now, we’ve been using the departure_hour feature to predict the minutes column.

It would be great if we could figure out a way to use the day of the week as a feature, since our intuition tells us this might have an effect on commute times (traffic is known to be bad on Monday mornings and Friday afternoons). But, the day column currently contains strings, not numbers, so we’ll have to think critically about how to encode this data numerically. (You’ll see how to do this in a later homework.)

For now, let’s extract the day of the month from the date column, and try and use day_of_month along with departure_hour as our two features.

Loading...

We want to use the first two columns to predict the third. What would the hypothesis function for such a model look like? Previously, we had:

h(departure houri)predicted commute time i=w0+w1departure houri\underbrace{h(\text{departure hour}_i)}_{\text{predicted commute time } i} = w_0 + w_1 \cdot\text{departure hour}_i

Now, a linear model that involves both features would look like:

h(departure houri,domi)=w0+w1departure houri+w2domih(\text{departure hour}_i, \text{dom}_i) = w_0 + w_1 \cdot\text{departure hour}_i + w_2 \cdot\text{dom}_i

This is called a multiple linear regression model, since it uses multiple features. (To keep the equation short, I’ve used dom\text{dom} to refer to the day of the month.)

The first hypothesis function, with one input variable, looked like a line in two dimensions. Our new multiple linear regression model looks like a plane in three dimensions!

Loading...

In two dimensional space, a line has a constant rate of change, described by its slope.

In three dimensional space, a plane has a constant rate of change along both the xx and yy axes. The plane’s rate of change in the xx (departure time) direction is given by w1w_1^*, and its rate of change in the yy (day of month) direction is given by w2w_2^*.

Finding the optimal values of w0w_0^*, w1w_1^*, and w2w_2^* involves minimizing, yet again, mean squared error:

Rsq(w0,w1,w2)=1ni=1n(yi(w0+w1departure houri+w2domi))2R_\text{sq}(w_0, w_1, w_2) = \frac{1}{n} \sum_{i=1}^n \left(y_i - (w_0 + w_1 \cdot \text{departure hour}_i + w_2 \cdot \text{dom}_i) \right)^2

Mean squared error here is a function of three variables, so we’d need four dimensions to visualize it, which, of course, we cannot do. To minimize RsqR_\text{sq} above, we’d need to find three partial derivatives, set them all to 0, and solve the resulting system of 3 equations and 3 unknowns.

While we could do this, eventually, we’ll want to consider a third feature, a fourth, and so on. In general, if we want dd features, we’ll have d+1d+1 parameters to optimize (one per feature, plus one for the intercept, w0w_0), and we’ll need to solve a system of d+1d+1 equations and d+1d+1 unknowns to find them. Continuing this approach sounds painful.

This is where linear algebra comes in. Linear algebra provides a compact, efficient way to represent and manipulate systems of linear equations, and to find solutions to them.

Chapter 2 is dedicated to introducing the ideas from linear algebra that we’ll need to solve this problem: the problem of finding the parameters that minimize mean squared error for the multiple linear regression model. It will take some time to get there, and at times, it’ll seem like what we’re doing is unrelated to this goal of finding optimal parameters.

But, I promise that we will close the loop in Chapter 3, when we return to multiple linear regression.