Skip to article frontmatterSkip to article content

3.1. Regression using Linear Algebra

At times, Chapter 2 may have seemed like a bit of a detour from the main storyline of the course. It’s now time to apply our understanding of vectors and matrices to the problem of linear regression.


Recap: The Modeling Recipe

In Chapter 1.2, we introduced the problem of predicting the length of our commute to school (yiy_i) as a function of the time we leave home (xix_i).

Image produced in Jupyter

That function was called a hypothesis function, denoted h(xi)h(x_i). Remember, the output of hh is a predicted yy-value.

predicted commute timei=h(departure houri)\text{predicted commute time}_i = h(\text{departure hour}_i)

We looked at two types of hypothesis function:

  • The constant model, h(xi)=wh(x_i) = w

  • The simple linear regression model, h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

We’ll focus on the latter, which was first introduced in Chapter 1.4. To find optimal model parameters, w0w_0^* (the best intercept) and w1w_1^* (the best slope), we followed the three-step modeling recipe:

1. Choose a model.

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

2. Choose a loss function. Our default choice has been squared loss:

Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss (also known as empirical risk) to find optimal parameters. Average squared loss – also known as mean squared error – for any hypothesis function hh, takes the form:

1ni=1n(yih(xi))2\frac{1}{n} \sum_{i=1}^n (y_i - h(x_i))^2

For the simple linear regression model, this becomes:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

In Chapter 1.4, we used calculus to minimize Rsq(w0,w1)R_\text{sq}(w_0, w_1) to find the optimal parameters, w0w_0^* and w1w_1^*. This involved taking two partial derivatives, Rsqw0\frac{\partial R_\text{sq}}{\partial w_0} and Rsqw1\frac{\partial R_\text{sq}}{\partial w_1}, setting them equal to zero, and solving the resulting system of equations. At the end of that process, we found

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=rσyσxoptimal slope,w0=yˉw1xˉoptimal intercept\underbrace{w_1^* = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = r \frac{\sigma_y}{\sigma_x}}_\text{optimal slope}, \quad \underbrace{w_0^* = \bar y - w_1^* \bar x}_\text{optimal intercept}

where rr is the correlation coefficient between xx and yy, and σx\sigma_x and σy\sigma_y are the standard deviations of xx and yy, respectively.

Those formulas, when applied to the dataset of commute times, describe the following line.

Image produced in Jupyter

Here’s the plan:

  • In Chapter 3.1 (that’s here), we’ll see another way to find w0w_0^* and w1w_1^* that doesn’t involve calculus, but rather leverages our new knowledge of vectors and matrices.

  • In Chapter 3.2, we’ll look at how to use this linear algebraic approach to add multiple input variables to our model. The orange line above only uses one input variable, departure time, but we might want to incorporate day of the week, weather, etc.

  • In Homeworks 8 and 9, we’ll give you a taste of how to approach the process of adding new features, and why adding lots of features doesn’t necessarily lead to better predictions on real-world, unseen data.


The Design Matrix

Big idea: How can we express mean squared error,

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2(actualpredicted)2R_\text{sq}(w_0, w_1) = \frac{1}{n} {\color{red} \sum_{i=1}^n} \underbrace{(y_i - (w_0 + w_1 x_i))^{\color{red}2}}_{({\text{actual} - \text{predicted}})^2}

in terms of vectors and matrices? If we can do so, then perhaps there will be another way to find w0w_0^* and w1w_1^* without needing to take partial derivatives. This will make our life a lot easier when we add more features to our model.

Remember that if xRn\vec x \in \mathbb{R}^n, then x2=i=1nxi2\lVert \vec x \rVert^2 = \sum_{i = 1}^n x_i^2. In the formula for RsqR_\text{sq} above, I’ve colored i=1n\color{red}\sum_{i = 1}^n and 2\cdot^{\color{red}2} in red to try and make the case that RsqR_\text{sq} looks a lot like the squared norm of vector that contains the errors of our predictions. Let’s try and define this vector.

Consider a dataset of nn points, (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n), on which we’d like to fit a simple linear regression model h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i.

  • The observation vector, y\color{orange} \vec y, is a vector in Rn\mathbb{R}^n with the nn yy-values from the dataset.

    y=[y1y2yn]=[actual commute time1actual commute time2actual commute timen]for the commute times example\color{orange} {\vec y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \underbrace{\begin{bmatrix} \text{actual commute time}_1 \\ \text{actual commute time}_2 \\ \vdots \\ \text{actual commute time}_n \end{bmatrix}}_{\text{for the commute times example}}

    This vector contains our “right answers” that we are trying to predict as best as we can.

  • The prediction vector, p\color{#004d40} \vec p, is a vector in Rn\mathbb{R}^n with the nn predicted values from the model.

    p=[h(x1)h(x2)h(xn)]=[w0+w1x1w0+w1x2w0+w1xn]{\color{#004d40} \vec p} = \begin{bmatrix} h(x_1) \\ h(x_2) \\ \vdots \\ h(x_n) \end{bmatrix} = \begin{bmatrix} w_0 + w_1 {\color{#3d81f6} x_1} \\ w_0 + w_1 {\color{#3d81f6} x_2} \\ \vdots \\ w_0 + w_1 {\color{#3d81f6} x_n} \end{bmatrix}

    We want p\color{#004d40} \vec p to be as close as possible to y\color{orange} \vec y.

We can express the prediction vector, p\color{#004d40} \vec p, as a matrix-vector product!

p=[w0+w1x1w0+w1x2w0+w1xn]=w0[111]+w1[x1x2xn]=[1x11x21xn][w0w1]=Xw{\color{#004d40} \vec p} = \begin{bmatrix} w_0 + w_1 {\color{#3d81f6} x_1} \\ w_0 + w_1 {\color{#3d81f6} x_2} \\ \vdots \\ w_0 + w_1 {\color{#3d81f6} x_n} \end{bmatrix} = w_0 \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + w_1 \begin{bmatrix} \color{#3d81f6} x_1 \\ \color{#3d81f6} x_2 \\ \vdots \\ \color{#3d81f6} x_n \end{bmatrix} = \begin{bmatrix} 1 & \color{#3d81f6} x_1 \\ 1 & \color{#3d81f6} x_2 \\ \vdots & \vdots \\ 1 & \color{#3d81f6} x_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \end{bmatrix} = {\color{#3d81f6} X} \vec w

In p=Xw\color{#004d40} \vec p = {\color{#3d81f6} X} \vec w,

  • X{\color{#3d81f6} X}, called the design matrix, is an n×2n \times 2 matrix with its first column being all 1s and its second column being the inputs x1,x2,,xn{\color{#3d81f6} x_1, x_2, \ldots, x_n}. It’s the most important among these definitions, hence why this section is titled “The Design Matrix”.

    X=[1x11x21xn]=[1departure time11departure time21departure timen]for the commute times example\color{#3d81f6} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & \text{departure time}_1 \\ 1 & \text{departure time}_2 \\ \vdots & \vdots \\ 1 & \text{departure time}_n \end{bmatrix}}_{\text{for the commute times example}}

    I haven’t been able to find a good reason for why this is called the design matrix; my best explanation is that it has to do with how we designed our model. The column of 1s is there for our intercept term, w0w_0.

  • The parameter vector, w\vec w, is a 2×12 \times 1 vector containing our model’s parameters, w0w_0 and w1w_1.

    w=[w0w1]\vec w = \begin{bmatrix} w_0 \\ w_1 \end{bmatrix}

    We’re trying to find the best choice of w\vec w.

Remember, our goal is to convert

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

into an expression that involves vectors and matrices. Using our new definitions, we’re almost there! RsqR_\text{sq} involves a sum of squared errors. So, let’s define the error vector, e\color{red} \vec e, as the difference between y\color{orange} \vec y and p=Xw{\color{#004d40} \vec p} = {\color{#3d81f6} X} \vec w.

e=yp=yXw=[y1y2yn][w0+w1x1w0+w1x2w0+w1xn]=[y1(w0+w1x1)y2(w0+w1x2)yn(w0+w1xn)]{\color{red} \vec e} = {\color{orange} \vec y} - {\color{#004d40} \vec p} = {\color{orange} \vec y} - {{\color{#3d81f6} X} \vec w} = {\color{orange} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}} - \begin{bmatrix} w_0 + w_1 {\color{#3d81f6} x_1} \\ w_0 + w_1 {\color{#3d81f6} x_2} \\ \vdots \\ w_0 + w_1 {\color{#3d81f6} x_n} \end{bmatrix} = \begin{bmatrix} {\color{orange} y_1} - (w_0 + w_1 {\color{#3d81f6} x_1}) \\ {\color{orange} y_2} - (w_0 + w_1 {\color{#3d81f6} x_2}) \\ \vdots \\ {\color{orange} y_n} - (w_0 + w_1 {\color{#3d81f6} x_n}) \end{bmatrix}

The components of e\color{red} \vec e are the quantities being summed and squared in RsqR_\text{sq}. How do we get that sum and square? By taking the norm, as I alluded to earlier.

e2=i=1nei2=i=1n(yi(w0+w1xi))2\lVert {\color{red} \vec e} \rVert^2 = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

e2\lVert {\color{red} \vec e} \rVert^2 is almost RsqR_\text{sq}, it’s just missing a 1n\frac{1}{n} up front. So,

Rsq(w)=1ne2=1nyXw2\boxed{R_\text{sq}(\vec w) = \frac{1}{n} \lVert {\color{red} \vec e} \rVert^2 = \frac{1}{n} \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2}

We’ve completed our “conversion” of RsqR_\text{sq} into a vector-based expression.


The Normal Equations Return

Now, our goal is to find the vector w\vec w^* that minimizes

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2

This sounds like a familiar problem. Xw{\color{#3d81f6} X} \vec w is a vector in colsp(X)\text{colsp}({\color{#3d81f6} X}), so it looks like we want to find the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y. The extra 1n\frac{1}{n} up front doesn’t change the optimization problem; as we studied in our calculus review at the start of the semester (and in the examples in Chapter 0.2), the minimizer of cf(w)cf(w) is the same as the minimizer of f(w)f(w) when c>0c > 0.

This is exactly the problem we solved in Chapter 2.10!

Loading...

In Chapter 2.10, we found that the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is the orthogonal projection of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}), which is the vector

p=Xw{\color{#004d40} \vec p} = {\color{#3d81f6} X} \vec w^*

where w\vec w^* is chosen to satisfy the normal equation,

XTXw=XTy{\color{#3d81f6} X^T X} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y

Note that both projection and prediction start with “p”, making it easy to remember that they’re related, and the vector p{\color{#004d40} \vec p} can be interpreted as either. Predictions are projections onto the column space of the design matrix.

What makes the projection “orthogonal” is that the projection p=Xw{\color{#004d40} \vec p} = {\color{#3d81f6} X} \vec w^* has an error vector e=yXw{\color{red} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w^* that is orthogonal to every vector in colsp(X)\text{colsp}({\color{#3d81f6} X}).

XTe=0{\color{#3d81f6} X^T} {\color{red} \vec e} = \vec 0

When X\color{#3d81f6} X’s columns are linearly independent, then XTX\color{#3d81f6} X^TX is invertible, and the unique solution w\vec w^* to the normal equation is

w=(XTX)1XTy\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} \color{orange} \vec y

The vector w\vec w^*, in this context, is called our optimal parameter vector, since it contains our optimal choices of parameters, w0w_0^* and w1w_1^*. Note that

X=[1x11x21xn]\color{#3d81f6} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}

is invertible as long as not all of the xix_i’s are the same. If they are all the same, then X\color{#3d81f6} X’s second column is a multiple of its first column, and so X\color{#3d81f6} X is not of full rank, and neither is XTX\color{#3d81f6} X^TX (since both matrices have the same rank). This corresponds to the case where the data points all lie on a single vertical line, so no line of the form h(xi)=w0+w1xih(x_i) = w_0^* + w_1^* x_i can pass through them.

So, to summarize,

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2

is minimized when

XTXw=XTy    w=(XTX)1XTy if X’s columns are linearly independent{\color{#3d81f6} X^TX} \vec w^* = {\color{#3d81f6} X^T} {\color{orange}\vec y} \implies \underbrace{\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} {\color{orange} \vec y}}_{\text{ if } {\color{#3d81f6} X} \text{'s columns are linearly independent}}

To be clear, when X\color{#3d81f6} X is the n×2n \times 2 design matrix and y\color{orange} \vec y is a vector with our nn yy-values, then w\vec w^* as defined above contains the exact same values as our calculus-based formulas for w0w_0^* and w1w_1^*! You’ll be tasked with completing a full proof of this in Homework 7.


Implementing w\vec w^* in Code

While you may have to wait until you complete Homework 7 to see a proof that

w=(XTX)1XTy=[yˉrσyσxxˉrσyσx]\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} {\color{orange} \vec y} = \begin{bmatrix} \bar{y} - r \frac{\sigma_y}{\sigma_x} \bar{x} \\ r \frac{\sigma_y}{\sigma_x} \end{bmatrix}

I think it’s worthwhile for me to give you a preview that these are equivalent, through the lens of code. The commutes DataFrame in pandas, stored below, contains our commute times data. The first column, departure_hour, is our xx-variable, while minutes is

commutes
Loading...

Approach 1: The Old Formulas

We’ve done this before, but let’s implement our original formulas for w0w_0^* and w1w_1^* in code.

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=rσyσx,w0=yˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = r \frac{\sigma_y}{\sigma_x}, \quad w_0^* = \bar y - w_1^* \bar x
def optimal_slope(x, y):
    r = np.corrcoef(x, y)[0, 1]
    return r * np.std(y) / np.std(x)

optimal_slope(commutes['departure_hour'], commutes['minutes'])
-8.186941724265557
def optimal_intercept(x, y):
    return np.mean(y) - optimal_slope(x, y) * np.mean(x)

optimal_intercept(commutes['departure_hour'], commutes['minutes'])
142.44824158772875

To actually use these optimal parameters to make predictions, we need to execute w0+w1xiw_0^* + w_1^* x_i ourselves.

def predicted_commute(departure_hour):
    w0_star = optimal_intercept(commutes['departure_hour'], commutes['minutes'])
    w1_star = optimal_slope(commutes['departure_hour'], commutes['minutes'])
    return w0_star + w1_star * departure_hour

predicted_commute(15)
19.644115723745387

Approach 2: The Normal Equations

Now, we’ll use the fact that

w=(XTX)1XTy\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} {\color{orange} \vec y}

to find the optimal slope and intercept. Remember, Chapter 2.9 told us that it’s generally a bad idea to use np.linalg.inv directly; instead, we should use np.linalg.solve to solve the normal equations.

# We need to make the n x 2 design matrix X, which has a column of 1's
# to account for the intercept.
commutes.loc[:, '1'] = 1
commutes[['1', 'departure_hour']]
Loading...
X = commutes[['1', 'departure_hour']]
y = commutes['minutes']

w_star = np.linalg.solve(X.T @ X, X.T @ y)
w_star
array([142.44824159, -8.18694172])

w_star contains the same values as our calculus-based formulas for w0w_0^* and w1w_1^* from above!

To use w_star to make predictions, we eventually need to evaluate w0+w1xiw_0^* + w_1^* x_i, but it turns out this can be expressed as a dot product between w\vec w^* and [1xi]\begin{bmatrix} 1 \\ x_i \end{bmatrix}. More on this in Chapter 3.2.

# Same as w0_star + w1_star * 15
np.dot(w_star, np.array([1, 15]))
19.644115723744356
# Also the same
w_star @ np.array([1, 15])
19.644115723744356

Approach 3: sklearn

Finally, we can use sklearn’s LinearRegression class to find w0w_0^* and w1w_1^*.

from sklearn.linear_model import LinearRegression

model = LinearRegression()

# By default, sklearn knows to minimize mean squared error,
# and knows to add an intercept term to the model.
# If you want to turn off the intercept, you can set `fit_intercept=False`.
model.fit(commutes[['departure_hour']].to_numpy(), commutes['minutes'].to_numpy())

print(model.intercept_, model.coef_)
142.4482415877287 [-8.18694172]

Once again, we get the same values as the prior two approaches! None of this should be a surprise, but it’s reassuring to see that (1) our calculus and linear algebra approaches are consistent, and (2) both of those are equivalent to using sklearn. All three approaches involve the same three step modeling recipe.

And finally, to use model to make predictions, we don’t need to do any of the math ourselves: we can use the predict method.

# The same, once again!
# Notice that the input is a 2D array.
# More on this in Chapter 3.2.
model.predict([[15]])
array([19.64411572])

The Three Pictures

Throughout Chapter 1.4 and here in Chapter 3.1, we’ve seen three different diagrams involving the simple linear regression model, and it’s important to understand what each one depicts.

Picture 1: The Data and Model (R2\mathbb{R}^2)

The first and most intuitive diagram is the one that shows the original points (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) in R2\mathbb{R}^2 with the fit line h(xi)=w0+w1xih(x_i) = w_0^* + w_1^* x_i.

Image produced in Jupyter

Picture 2: The Projection View (Rn\mathbb{R}^n)

Then, there is the diagram that shows that the optimal predictions, p=Xw{\color{#004d40} \vec p} = {\color{#3d81f6} X} \vec w^*, are the orthogonal projections of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}). Unlike the above plot, which is in R2\mathbb{R}^2, this one is in Rn\mathbb{R}^n, where nn is our number of data points.

Loading...

Picture 3: The Loss Surface (R3\mathbb{R}^3)

The final relevant picture is that of the graph of mean squared error, i.e. the graph whose xx-axis is w0\vec w_0, yy-axis is w1\vec w_1, and whose zz-axis is Rsq(w)R_\text{sq}(\vec w). The w0w_0 and w1w_1 coordinates of the “bottom” of the graph correspond to the optimal parameters in w\vec w^*, which are the weights in the linear combination of the columns of X\color{#3d81f6} X that is closest to y\color{orange} \vec y.

Loading...

You should take time to understand how these are all related.

  • Our goal is to find the best fitting line in Picture 1.

  • To do that, we minimize mean squared error in Picture 3.

  • To do that, we project y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}) in Picture 2.

With this in mind, let’s move to Chapter 3.2, where we’ll see how to extend our new linear algebraic approach to multiple input variables, in what’s called multiple linear regression.