Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.10. Projections, Revisited

We’re almost ready to return to our original motivation for studying linear algebra, which was to perform linear regression using multiple input variables. This section outlines the final piece of the puzzle.


Approximating using a Single Vector

In Chapter 2.3, we introduced the approximation problem, which asked:

Among all vectors of the form kxk \color{#3d81f6} \vec x, which one is closest to y{\color{orange}\vec y}?

We now know the answer is the vector p\color{#004d40} \vec p, where

p=(yxxx)x{\color{#004d40} \vec p} = \left( \frac{{\color{orange}\vec y} \cdot \color{#3d81f6} \vec x}{{\color{#3d81f6}\vec x} \cdot {\color{#3d81f6}\vec x}} \right) \color{#3d81f6} \vec x

p\color{#004d40} \vec p is called the orthogonal projection of y\color{orange} \vec y onto x\color{#3d81f6} \vec x.

Note that I’ve used y\color{orange} \vec y and x\color{#3d81f6} \vec x here rather than u\color{orange} \vec u and v\color{#3d81f6} \vec v, just to make the notation more consistent with the notation we’ll use as we move back into the world of machine learning.

Loading...
Loading...
Image produced in Jupyter

As we’ve studied, the resulting error vector,

e=yp{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#004d40} \vec p}

is orthogonal to x\color{#3d81f6} \vec x.

In our original look at the approximation problem, we were approximating y\color{orange} \vec y using a scalar multiple of just a single vector, x\color{#3d81f6} \vec x. The set of all scalar multiples of x\color{#3d81f6} \vec x, denoted by span({x})\text{span}(\{ {\color{#3d81f6} \vec x}\}), is a line in Rn\mathbb{R}^n.

Key idea: instead of projecting onto the subspace spanned by just a single vector, how might we project onto the subspace spanned by multiple vectors?


Approximating using Multiple Vectors

Equipped with our understanding of linear independence, spans, subspaces, and column spaces, we’re ready to tackle a more advanced version of the approximation problem.

All three statements at the bottom of the box above are asking the exact same question; I’ve presented all three forms so that you see more clearly how the ideas of spans, column spaces, and matrix-vector multiplication fit together. I will tend to refer to the latter two versions of the problem the most. In what follows, suppose X\color{#3d81f6} X is an n×dn \times d matrix whose columns x(1){\color{#3d81f6} \vec x^{(1)}}, x(2){\color{#3d81f6} \vec x^{(2)}}, ..., x(d){\color{#3d81f6} \vec x^{(d)}} are the building blocks we want to approximate y\color{orange} \vec y with.

First, let’s get the trivial case out of the way. If ycolsp(X){\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X}), then the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is just y\color{orange} \vec y itself. If that’s the case, there exists some w\vec w such that y=Xw{\color{orange} \vec y} = {\color{#3d81f6} X} \vec w exactly. This w\vec w is unique only if X{\color{#3d81f6} X}'s columns are linearly independent; otherwise, there will be infinitely many good w\vec w’s.

But, that’s not the case I’m really interested in. I care more about when y\color{orange} \vec y is not in colsp(X)\text{colsp}({\color{#3d81f6} X}). (Remember, this is the case we’re interested in when we’re doing linear regression: usually, it’s not possible to make our predictions 100% correct, and we’ll have to settle for some error.)

Then what?

In general, colsp(X)\text{colsp}({\color{#3d81f6} X}) is an rr-dimensional subspace of Rn\mathbb{R}^n, where r=rank(X)r = \text{rank}({\color{#3d81f6} X}). In the diagram below, I’ve used a plane to represent colsp(X)\text{colsp}({\color{#3d81f6} X}); just remember that X{\color{#3d81f6} X} may have more than 3 rows or columns.

Loading...

Remember that colsp(X)\text{colsp}({\color{#3d81f6} X}) is the set of linear combinations of X{\color{#3d81f6} X}'s columns, so it’s the set of all vectors that can be written as Xw{\color{#3d81f6} X} \vec w, where wRd\vec w \in \mathbb{R}^d.

Let’s consider two possible vectors of the form Xw{\color{#3d81f6} X} \vec w, and look at their corresponding error vectors, e=yXw{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w. I won’t draw the columns of X{\color{#3d81f6} X}, since those would clutter up the picture.

Loading...

Our problem boils down to finding the w\vec w that minimizes the norm of the error vector. Since it’s a bit easier to work with squared norms (remember that x2=xx\lVert \vec x \rVert^2 = \vec x \cdot \vec x), we’ll minimize the squared norm of the error vector instead; this is an equivalent problem, since the norm is non-negative to begin with.

e2=yXw2which w minimizes this?\underbrace{\lVert {\color{#d81a60} \vec e} \rVert^2 = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2}_{\text{which $\vec w$ minimizes this?}}

Think of yXw2\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2 as a function of w\vec w only; X\color{#3d81f6} X and y\color{orange} \vec y should be thought of as fixed. This is a least squares problem: we’re looking for the w\vec w that minimizes the sum of squared errors between y\color{orange} \vec y and Xw{\color{#3d81f6} X} \vec w.

There are two ways we’ll minimize this function of w\vec w:

  1. Using a geometric argument, as we did in the single vector case.

  2. Using calculus. This is more involved than before, since the input variable is a vector, not a scalar, but it can be done, as we’ll see in Chapter 4.

Let’s focus on the geometric argument. What does our intuition tell us? Extending the single vector case, we expect the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y to be the orthogonal projection of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}): that is, its error should be orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}).

We could see this intuitively in the visual above. wo\vec w_\text{o} was chosen to make eo\color{#d81a60}\vec e_\text{o} orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}), meaning that eo\color{#d81a60}\vec e_\text{o} is orthogonal to every vector in colsp(X)\text{colsp}({\color{#3d81f6} X}). (The subscript “o” stands for “orthogonal”.) w\vec w' was some other arbitrary vector, leading e\color{#d81a60}\vec e' to not be orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}). Clearly eo\color{#d81a60}\vec e_\text{o} is shorter than e\color{#d81a60}\vec e'.

To prove that the optimal choice of w\vec w comes from making the error vector orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}), you could use the same argument as in the single vector case: if you consider two vectors, wo\vec w_\text{o} with an orthogonal error vector eo\color{#d81a60}\vec e_\text{o}, and w\vec w' with an error vector e\color{#d81a60}\vec e' that is not orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}), then we can draw a right triangle with e\color{#d81a60}\vec e' as the hypotenuse and eo\color{#d81a60}\vec e_\text{o} as one of the legs, making

e2>eo2\lVert {\color{#d81a60}\vec e'} \rVert^2 > \lVert {\color{#d81a60}\vec e_\text{o}} \rVert^2

This is such an important idea that I want to redraw the picture above with just the orthogonal projection. Note that I’ve replaced wo\vec w_\text{o} with w\vec w^* to indicate that this is the optimal choice of w\vec w.

Image produced in Jupyter

A Proof that the Orthogonal Error Vector is the Shortest

In office hours, a student asked for more justification that the shortest possible error vector is one that’s orthogonal to the column space of X\color{#3d81f6} X, especially because it’s hard to visualize what orthogonality looks like in higher dimensions. Remember, all it means for two vectors to be orthogonal is that their dot product is 0.

Given that, let’s assume only that

  • wo\vec w_\text{o} is chosen so that eo=yXwo{\color{#d81a60} \vec e_\text{o}} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o} is orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}), and

  • w\vec w' is any other choice of w\vec w, with a corresponding error vector e=yXw{\color{#d81a60} \vec e'} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w'.

Just with these facts alone, we can show that eo\color{#d81a60} \vec e_\text{o} is the shortest possible error vector. To do so, let’s start by considering the (squared) magnitude of e\color{#d81a60} \vec e':

e2=yXw2=yXw+XwoXwo2seems arbitrary, but it’s a legal operation that brings woback in=yXwoa"+X(wow)b"2=yXwo2+X(wow)2+2(yXwo)X(wow)a+b2=a2+b2+2ab=eo2+X(wow)2+2eoX(wow)0, because eo is orthogonal to the columns of X=eo2+X(wow)2eo2\begin{align*} \lVert {\color{#d81a60} \vec e'} \rVert^2 &= \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w' \rVert^2 \\ &= \underbrace{\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w' + {\color{#3d81f6} X} \vec w_\text{o} - {\color{#3d81f6} X} \vec w_\text{o} \rVert^2}_{\text{seems arbitrary, but it's a legal operation that brings } \vec w_\text{o} \text{back in}} \\ &= \lVert \underbrace{{\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o}}_{``a"} + \underbrace{{\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{``b"} \rVert^2 \\ &= \underbrace{\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 + 2 ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o}) \cdot {\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{\lVert \vec a + \vec b \rVert^2 = \lVert \vec a \rVert^2 + \lVert \vec b \rVert^2 + 2 \vec a \cdot \vec b} \\ &= \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 + \underbrace{2 {\color{#d81a60} \vec e_\text{o}} \cdot {\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{\text{0, because } {\color{#d81a60} \vec e_\text{o}} \text{ is orthogonal to the columns of } \color{#3d81f6} X} \\ &= \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 \\ \geq \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 \end{align*}

So, no matter what choice of w\vec w' we make, the magnitude of e\color{#d81a60} \vec e' can’t be smaller than the magnitude of eo\color{#d81a60} \vec e_\text{o}. This means that the error vector that is orthogonal to the column space of X\color{#3d81f6} X is the shortest possible error vector.

This is really just the same proof as in Chapter 2.3, where we argued that eo{\color{#d81a60} \vec e_\text{o}}, X(wow){\color{#3d81f6} X}(\vec w_\text{o} - \vec w'), and e\color{#d81a60} \vec e' form a right triangle, where e\color{#d81a60} \vec e' is the hypotenuse.


The Normal Equation

We’ve come to the conclusion that in order to find the w\vec w that minimizes

e2=yXw2\lVert {\color{#d81a60} \vec e} \rVert^2 = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2

we need to find the w\vec w that makes the error vector e=yXw{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w orthogonal to colsp(X)\text{colsp}({\color{#3d81f6} X}). colsp(X)\text{colsp}({\color{#3d81f6} X}) is the set of all linear combinations of X{\color{#3d81f6} X}'s columns. So, if we can find an e\color{#d81a60} \vec e that is orthogonal to every column of X{\color{#3d81f6} X}, then it must be orthogonal to any of their linear combinations, too.

So, we’re looking for a e=yXw{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w that satisfies

x(1)(yXw)=0x(2)(yXw)=0x(d)(yXw)=0\begin{align*} {\color{#3d81f6} \vec x^{(1)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \\ {\color{#3d81f6} \vec x^{(2)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \\ &\vdots \\ {\color{#3d81f6} \vec x^{(d)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \end{align*}

As you might have guessed, there’s an easier way to write these dd equations simultaneously. Above, we’re taking the dot product of yXw{\color{orange} \vec y} - {\color{#3d81f6} X} \vec w with each of the columns of X\color{#3d81f6} X. We’ve learned that AvA \vec v contains the dot products of v\vec v with the rows of AA. So how do we get the dot products of yXw{\color{orange} \vec y} - {\color{#3d81f6} X} \vec w with the columns of X\color{#3d81f6} X? Transpose it!

XT(yXw)=[x(1)Tx(2)Tx(d)T](yXw)=[x(1)(yXw)x(2)(yXw)x(d)(yXw)]{\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = {\color{#3d81f6}\begin{bmatrix} {\color{#3d81f6} —} & {\color{#3d81f6} \vec{x}^{(1)}}^T & {\color{#3d81f6} —} \\ {\color{#3d81f6} —} & {\color{#3d81f6} \vec{x}^{(2)}}^T & {\color{#3d81f6} —} \\ & \vdots & \\ {\color{#3d81f6} —} & {\color{#3d81f6} \vec{x}^{(d)}}^T & {\color{#3d81f6} —} \end{bmatrix}} ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = \begin{bmatrix} {\color{#3d81f6}\vec x^{(1)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \\ {\color{#3d81f6}\vec x^{(2)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \\ \vdots \\ {\color{#3d81f6}\vec x^{(d)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \end{bmatrix}

So, if we want yXw{\color{orange} \vec y} - {\color{#3d81f6} X} \vec w to be orthogonal to each of the columns of X\color{#3d81f6} X, then we need XT(yXw)=0{\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = \vec 0 (note that this is the vector 0Rd\vec 0 \in \mathbb{R}^d, not the scalar 0.) Another way of saying this is that we need the error vector to be in the left null space of X\color{#3d81f6} X, i.e. enull(XT){\color{#d81a60} \vec e} \in \text{null}({\color{#3d81f6} X^T}).

XTe=0XT(yXw)=0XTyXTXw=0XTXw=XTy\begin{align*} {\color{#3d81f6}X^T}{\color{#d81a60} \vec e} &= 0 \\{\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= \vec 0 \\ {\color{#3d81f6} X^T} {\color{orange} \vec y} - {\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w &= \vec 0 \\ {\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w &= {\color{#3d81f6} X^T} {\color{orange} \vec y} \end{align*}

The final equation above is called the normal equation. “Normal” means “orthogonal”. Sometimes it’s called the normal equations to reference the fact that it’s a system of dd equations and dd unknowns, where the unknowns are components of w\vec w (w1,w2,,wdw_1, w_2, \ldots, w_d.)

Note that XTXw=XTy{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y} looks a lot like Xw=y{\color{#3d81f6} X} \vec w = \color{orange} \vec y, with added factors of XT\color{#3d81f6} X^T on the left. Remember that if y\color{orange} \vec y is in colsp(X)\text{colsp}({\color{#3d81f6} X}), then Xw=y{\color{#3d81f6} X} \vec w = \color{orange} \vec y has a solution, but that’s usualy not the case, hence why we’re attempting to approximate y\color{orange} \vec y with a linear combination of X\color{#3d81f6} X’s columns.

Is there a unique vector w\vec w that satisfies the normal equation? That depends on whether XTX\color{#3d81f6} X^TX is invertible. XTX\color{#3d81f6} X^TX is a d×dd \times d matrix with the same rank as the n×dn \times d matrix X\color{#3d81f6} X, as we proved in Chapter 2.8.

rank(XTX)=rank(X)\text{rank}({\color{#3d81f6} X^TX}) = \text{rank}({\color{#3d81f6} X})

So, XTX\color{#3d81f6} X^TX is invertible if and only if rank(X)=d\text{rank}({\color{#3d81f6} X}) = d, meaning all of X\color{#3d81f6} X’s columns are linearly independent. In that case, the best choice of w\vec w is the unique vector

w=(XTX)1XTy\boxed{\vec w^* = ({\color{#3d81f6} X^TX})^{-1} \color{#3d81f6}X^T \color{orange} \vec y}

w\vec w^* has a star on it, denoting that it is the best choice of w\vec w. I don’t ask you to memorize much (you get to bring a notes sheet into your exams, after all), but this equation is perhaps the most important of the semester! It might even look familiar: back in the single vector case in Chapter 2.3, the optimal coefficient on x\vec x was xyxx=xTyxTx\frac{\vec x \cdot \vec y}{\vec x \cdot \vec x} = \frac{\vec x^T \vec y}{\vec x^T \vec x}, which looks similar to the one above. The difference is that here, XTX\color{#3d81f6} X^T X is a matrix, not a scalar. (But, if X\color{#3d81f6} X is just a matrix with a single column, then XTX\color{#3d81f6} X^T X is just the dot product of X\color{#3d81f6} X with itself, which is a scalar, and the boxed formula above reduces to the formula from Chapter 2.3.)

What if XTX\color{#3d81f6} X^TX isn’t invertible? Then, there are infinitely many w\vec w^*'s that satisfy the normal equation,

XTXw=XTy{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y}

It’s not immediately obvious what it means for there to be infinitely many solutions to the normal equation; I’ve dedicated a whole subsection to it below to give this idea the consideration it deserves.

In the examples that follow, we’ll look at how to find all of the solutions to this equation when there are infinitely many.

First, let’s start with a straightforward example. Let

X=[10211101],y=[1045]{\color{#3d81f6} X = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}}, \quad {\color{orange} \vec y = \begin{bmatrix} 1 \\ 0 \\ 4 \\ 5 \end{bmatrix}}

The vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is the vector Xw{\color{#3d81f6} X} \vec w^*, where w\vec w^* is the solution to the normal equations,

XTXw=XTy{\color{#3d81f6} X^T X} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y

The first step is to compute XTX{\color{#3d81f6} X^T X}, which is a 2×22 \times 2 matrix of dot products of the columns of X{\color{#3d81f6} X}.

XTX=[6333]{\color{#3d81f6} X^T X = \begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}}

XTX{\color{#3d81f6} X^T X} is invertible, so we can solve for w\vec w^* uniquely. Remember that in practice, we’d ask Python to solve np.linalg.solve(X.T @ X, X.T @ y), but here XTX\color{#3d81f6} X^T X is small enough that we can invert it by hand.

XTX=[6333]    (XTX)1=19[3336]=[1/31/31/32/3]{\color{#3d81f6} X^T X = \begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}} \implies ({\color{#3d81f6} X^T X})^{-1} = \frac{1}{9} \begin{bmatrix} 3 & -3 \\ -3 & 6 \end{bmatrix} = \begin{bmatrix} 1/3 & -1/3 \\ -1/3 & 2/3 \end{bmatrix}

Then,

w=(XTX)1XTy=[1/31/31/32/3](XTX)1[12100111]XT[1045]=[28/3]\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} {\color{orange} \vec y} = \underbrace{\begin{bmatrix} 1/3 & -1/3 \\ -1/3 & 2/3 \end{bmatrix}}_{({\color{#3d81f6} X^T X})^{-1}} \underbrace{{\color{#3d81f6}\begin{bmatrix} 1 & 2 & 1 & 0 \\ 0 & 1 & 1 & -1 \end{bmatrix}}}_{\color{#3d81f6}X^T} {\color{orange}\begin{bmatrix} 1 \\ 0 \\ 4 \\ 5 \end{bmatrix}} = \begin{bmatrix} 2 \\ -8/3 \end{bmatrix}

The magic is in the interpretation of the numbers in w\vec w^*, 2 and 83-\frac{8}{3}. These are the coefficients of the columns of X{\color{#3d81f6} X} in the linear combination that is closest to y\color{orange} \vec y. Meaning,

Xw=2[1210]x(1)73[0111]x(2)=[25/31/37/3]{\color{#3d81f6} X} \vec w^* = 2 \underbrace{\color{#3d81f6}\begin{bmatrix} 1 \\ 2 \\ 1 \\ 0 \end{bmatrix}}_{\color{#3d81f6}\vec x^{(1)}} - \frac{7}{3} \underbrace{\color{#3d81f6}\begin{bmatrix} 0 \\ 1 \\ 1 \\ -1 \end{bmatrix}}_{\color{#3d81f6}\vec x^{(2)}} = \begin{bmatrix} 2 \\ 5/3 \\ -1/3 \\ 7/3 \end{bmatrix}

is the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y. This vector is the orthogonal projection of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}).


Examples

The first example above is the most concrete. The examples that follow will build our understanding of how orthogonal projections really work.

Example: Point and Plane

Find the point on the plane 6x3y+2z=06x - 3y + 2z = 0 that is closest to the point (1,1,1)(1, 1, 1).

Example: What if ycolsp(X){\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X})?

Find the orthogonal projection of y=[1321]\color{orange} \vec y = \begin{bmatrix} 1 \\ 3 \\ 2 \\ -1 \end{bmatrix} onto colsp(X)\text{colsp}({\color{#3d81f6} X}), where X=[10211101]{\color{#3d81f6} X = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}}.

Example: Orthogonality with the Columns of XX

Let X=[12320212]\color{#3d81f6} X = \begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 0 & 2 \\ 1 & 2 \end{bmatrix}, Z=[10010130]\color{#004d40} Z = \begin{bmatrix} 1 & 0 \\ 0 & -1 \\ 0 & 1 \\ 3 & 0 \end{bmatrix}, and y\color{orange} \vec y be any vector in R4\mathbb{R}^4.

Let pX\vec p_X and pZ\vec p_Z be the orthogonal projections of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}) and colsp(Z)\text{colsp}({\color{#004d40} Z}), respectively.

Explain why it is guaranteed that the components of the vector

eX=ypX{\color{#d81a60} \vec e_X} = {\color{orange} \vec y} - \vec p_X

sum to zero, but the components of the vector eZ=ypZ{\color{#d81a60} \vec e_Z} = {\color{orange} \vec y} - \vec p_Z do not necessarily.

Example: XX with Orthogonal Columns

Let X=[3/134/54/133/512/130]\color{#3d81f6} X = \begin{bmatrix} 3/13 & -4/5 \\ 4/13 & 3/5 \\ 12/13 & 0 \end{bmatrix} and y=[101]\color{orange} \vec y = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}.

Find the value of w\vec w^* that minimizes yXw2\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2. What about X\color{#3d81f6} X makes this easier than in other examples?

Example: Why Can’t We Separate?

Why can’t we separate

w=(XTX)1XTy\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} \color{orange} \vec y

into

w=X1(XT)1XTIy=X1y\vec w^* = {\color{#3d81f6} X^{-1}} \underbrace{({\color{#3d81f6} X^T})^{-1} {\color{#3d81f6} X^T}}_{I} {\color{orange} \vec y} = {\color{#3d81f6} X^{-1}} \color{orange} \vec y

in general?


What if XX’s Columns are Linearly Dependent?

In the case where X\color{#3d81f6} X’s columns are linearly dependent, we can’t invert XTX\color{#3d81f6} X^T \color{#3d81f6} X to solve for w\vec w^*. This means that

XTXw=XTy{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y}

has infinitely many solutions. Let’s give more thought to what these solutions actually are.

First, note that all of these solutions for w\vec w^* correspond to the same projection, p=Xw\vec p = X \vec w^*. The “best approximation” of y\color{orange} \vec y in colsp(X)\text{colsp}({\color{#3d81f6}X}) is always just one vector; if there are infinitely many w\vec w^*'s, that just means there are infinitely many ways of describing that one best approximation. This is because if vectors are linearly dependent, then their linear combinations aren’t unique, but if they are linearly independent, their linear combinations are unique.

Let me drive this point home further. Let’s suppose both w1\vec w_1 and w2\vec w_2 satisfy

XTXw=XTy{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y}

Then,

XTXw1XTXw2=yy=0X^TX \vec w_1 - X^TX \vec w_2 = \vec y - \vec y = \vec 0

which means that

(XTX)(w1w2)=0(X^TX)(\vec w_1 - \vec w_2) = \vec 0

i.e. the difference between the two vectors, w1w2\vec w_1 - \vec w_2, is in nullsp(XTX)\text{nullsp}(X^TX). But, back in Chapter 2.8, we proved that XTXX^TX and XX have the same null space, meaning any vector that gets sent to 0\vec 0 by XX also gets sent to 0\vec 0 by XTXX^TX, and vice versa.

So,

X(w1w2)=0X(\vec w_1 - \vec w_2) = \vec 0

too, but that just means

Xw1=Xw2X \vec w_1 = X \vec w_2

meaning that even though w1\vec w_1 and w2\vec w_2 are different-looking coefficient vectors, they both still correspond to the same linear combination of XX’s columns!

Let’s see how we can apply this to an example. Let X=[310621310]{\color{#3d81f6} X = \begin{bmatrix} 3 & 1 & 0 \\ 6 & 2 & 1 \\ 3 & 1 & 0 \end{bmatrix}} and y=[213]{\color{orange} \vec y = \begin{bmatrix} 2 \\ 1 \\ 3 \end{bmatrix}}. This is an example of a matrix with linearly dependent columns, so there’s no unique w\vec w^* that satisfies the normal equations.

Finding One Solution

One way to find a possible vector w\vec w^* is to solve the normal equations. XTX\color{#3d81f6} X^T \color{#3d81f6} X is not invertible, so we can’t solve for w\vec w^* uniquely, but we can try and find a parameterized solution.

I’m going to take a different approach here: instead, let’s just toss out the linearly dependent columns of X\color{#3d81f6} X and solve for w\vec w^* using the remaining columns. Then, w\vec w^* for the full X\color{#3d81f6} X can use the same coefficients for the linearly independent columns, but 0s for the dependent ones. Removing the linearly dependent columns does not change colsp(X)\text{colsp}({\color{#3d81f6} X}) (i.e. the set of all linear combinations of X\color{#3d81f6} X’s columns), so the projection is the same.

The easy solution is to keep columns 2 and 3, since their numbers are smallest. So, for now, let’s say

X=[102110],y=[213]X' = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 0 \end{bmatrix}, \quad \color{orange} \vec y = \begin{bmatrix} 2 \\ 1 \\ 3 \end{bmatrix}

Here, w=(XTX)1XTy=[5/24]\vec w' = (X'^T X')^{-1} X'^T {\color{orange} \vec y} = \begin{bmatrix} 5/2 \\ -4 \end{bmatrix}. I won’t bore you with the calculations; you can verify them yourself.

Now, one possible w\vec w^* for the full X\color{#3d81f6} X is [05/24]\begin{bmatrix} 0 \\ 5/2 \\ -4\end{bmatrix}, which keeps the same coefficients on columns 2 and 3 as in w\vec w', but 0 for the column we didn’t use.

Finding All Solutions

As I mentioned above, if there are infinitely many solutions to the normal equation, then the difference between any two solutions is in nullsp(XTX)\text{nullsp}({\color{#3d81f6} X^TX}), which is also nullsp(X)\text{nullsp}({\color{#3d81f6} X}). Put another way, if ws\vec w_s satisfies the normal equations, then so does ws+n\vec w_s + \vec n for any nnullsp(XTX)\vec n \in \text{nullsp}({\color{#3d81f6} X^TX}).

XTXws=XTyXTX(ws+n)=XTXws+XTXn0, by definition of null space=XTy+0\begin{align*} {\color{#3d81f6} X^T X} \vec w_s &= {\color{#3d81f6} X^T} \color{orange} \vec y \\ {\color{#3d81f6} X^T X} (\vec w_s + \vec n) &= {\color{#3d81f6} X^T X} \vec w_s + \underbrace{{\color{#3d81f6} X^T X} \vec n}_{\vec 0, \text{ by definition of null space}} &= {\color{#3d81f6} X^T} {\color{orange} \vec y} + \vec 0 \\ \end{align*}

So, once we have one w\vec w^*, to get the rest, just add any vector in nullsp(XTX)\text{nullsp}({\color{#3d81f6} X^TX}) or nullsp(X)\text{nullsp}({\color{#3d81f6} X}) (since those are the same spaces).

What is nullsp(X)\text{nullsp}({\color{#3d81f6} X})? It’s the set of vectors v\vec v such that Xv=0{\color{#3d81f6} X} \vec v = \vec 0.

In our particular example,

X=[310621310]{\color{#3d81f6} X = \begin{bmatrix} 3 & 1 & 0 \\ 6 & 2 & 1 \\ 3 & 1 & 0 \end{bmatrix}}

we see that rank(X)=2\text{rank}({\color{#3d81f6} X}) = 2, so nullsp(X)\text{nullsp}({\color{#3d81f6} X}) has a dimension of 32=13 - 2 = 1 (by the rank-nullity theorem), so it’s going to be the span of a single vector. All we need to do now is find one vector in nullsp(X)\text{nullsp}({\color{#3d81f6} X}), and we will know that the null space is the set of scalar multiples of that vector.

X[130]=[310621310][130]=[000]{\color{#3d81f6} X} \begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix} = {\color{#3d81f6}\begin{bmatrix} 3 & 1 & 0 \\ 6 & 2 & 1 \\ 3 & 1 & 0 \end{bmatrix}}\begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}

So, since nullsp(X)=nullsp(XTX)=span({[130]})\text{nullsp}({\color{#3d81f6} X}) = \text{nullsp}({\color{#3d81f6} X^TX}) = \text{span}\left(\left\{\begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix}\right\}\right), we know that the set of all possible w\vec w^*'s is

[05/24]+t[130],tRthere are infinitely many solutions to the normal equations, but they’re all of this form\underbrace{\begin{bmatrix} 0 \\ 5/2 \\ -4 \end{bmatrix} + t\begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix}, t \in \mathbb{R}}_\text{there are infinitely many solutions to the normal equations, but they're all of this form}

This is not a subspace, since it doesn’t contain the zero vector.


There’s another way to arrive at this set of possible w\vec w^*'s: we can solve the normal equations directly. I wouldn’t recommend this second approach since it’s much longer, but I’ll add it here for completeness.

XTX=[363121010][310621310]=[541861862621]XTy=[363121010][213]=[2171]\begin{align*} {\color{#3d81f6}X^TX}&={\color{#3d81f6}\begin{bmatrix} 3 & 6 & 3 \\ 1 & 2 & 1 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 3 & 1 & 0 \\ 6 & 2 & 1 \\ 3 & 1 & 0 \end{bmatrix}}={\color{#3d81f6}\begin{bmatrix}54 & 18 & 6 \\ 18 & 6 & 2 \\ 6 & 2 & 1 \end{bmatrix}} \\ \\ {\color{#3d81f6}X^T}{\color{orange}\vec y}&={\color{#3d81f6}\begin{bmatrix} 3 & 6 & 3 \\ 1 & 2 & 1 \\ 0 & 1 & 0 \end{bmatrix}}{\color{orange}\begin{bmatrix} 2 \\ 1 \\ 3 \end{bmatrix}}=\begin{bmatrix} 21 \\ 7 \\ 1 \end{bmatrix} \end{align*}

Then, the normal equations XTXw=XTy{\color{#3d81f6}X^TX}\vec w^*={\color{#3d81f6}X^T}\color{orange}\vec y give us

54w1+18w2+6w3=2118w1+6w2+2w3=76w1+2w2+w3=1\begin{align*} 54w^*_1 + 18w^*_2 + 6w^*_3 &= 21 \\18w^*_1 + 6w^*_2 + 2w^*_3&=7 \\6w^*_1 + 2w^*_2 + w^*_3 &= 1 \end{align*}

The first and second equations are just scalar multiples of each other, so we can disregard one of them, and solve for a form where we can use one unknown as a parameter for the other two. To illustrate, let’s pick w1=tw_1^* = t.

18t+6w2+2w3=76t+2w2+w3=1\begin{align*} 18t + 6w^*_2 + 2w^*_3&=7 \tag{2} \\6t + 2w^*_2 + w^*_3 &= 1 \tag{3} \end{align*}

(2)3(3)(2) - 3 \cdot (3) gives us w3=4w_3^* = -4. Plugging this into both equations gives us

18t+6w28=7    18t+6w2=156t+2w24=1    6t+2w2=5\begin{align*} 18t + 6w_2^* - 8 &= 7 \implies 18t + 6w_2^* &= 15 \\6t + 2w_2^* - 4 &= 1 \implies 6t + 2w_2^* &= 5 \end{align*}

These are now both the same equation; the first one is just 3 times the second. So, we can solve for w2w_2^* in terms of tt:

w2=56t2w_2^* = \frac{5-6t}{2}

which gives us the complete solution

w=[t56t24],tR\vec w^* = \begin{bmatrix} t \\ \frac{5-6t}{2} \\ -4 \end{bmatrix}, t \in \mathbb{R}

This is the exact same line as using the null space approach! Plug in t=0t = 0 to get [05/24]\begin{bmatrix} 0 \\ 5/2 \\ -4 \end{bmatrix}, for example. Again, this is not a subspace, since it doesn’t contain the zero vector.


The Projection Matrix

So far, we’ve established that the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is the vector Xw{\color{#3d81f6} X} \vec w^*, where w\vec w^* is the solution to the normal equations,

XTXw=XTy{\color{#3d81f6} X^T X} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y

If XTXX^TX is invertible, then w\vec w^* is the unique vector

w=(XTX)1XTy\vec w^* = ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T \color{orange} \vec y

meaning that the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is

p=Xw=X(XTX)1XTy\vec p = {\color{#3d81f6} X} \vec w^* = {\color{#3d81f6} X} ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T \color{orange} \vec y

You’ll notice that the above expression also looks like a linear transformation applied to y\color{orange} \vec y, where y\color{orange} \vec y is being multiplied by the matrix

P=X(XTX)1XTP = {\color{#3d81f6} X} ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T

The matrix PP is called the projection matrix. In other classes, it is called the “hat matrix”, because they might use w^\hat{\mathbf{w}} instead of w\vec w^* and y^\hat{\mathbf{y}} instead of p\vec p, and in that notation, y^=Py\hat{\mathbf{y}} = P \mathbf{y}, so PP puts a “hat” on y\mathbf{y}. (I don’t use hat notation in this class because drawing a hat on top of a vector is awkward. Doesn’t w^\hat{\vec w} look strange?)

So,

p=Xw=Py\vec p = {\color{#3d81f6} X} \vec w^* = P \color{orange} \vec y

shows us that there are two ways to interpret the act of projecting y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}):

  1. The resulting vector is some optimal linear combination of X{\color{#3d81f6} X}'s columns.

  2. The resulting vector is the result of applying the linear transformation PP to y\color{orange} \vec y.

Let’s work out an example. Suppose

X=[30015460],y=[123]{\color{#3d81f6} X = \begin{bmatrix} {\color{#3d81f6} 3} & {\color{#3d81f6} 0} \\ {\color{#3d81f6} 0} & {\color{#3d81f6} 154} \\ {\color{#3d81f6} 6} & {\color{#3d81f6} 0} \end{bmatrix}}, \quad {\color{orange} \vec y = \begin{bmatrix} {\color{orange} 1} \\ {\color{orange} 2} \\ {\color{orange} 3} \end{bmatrix}}

X\color{#3d81f6} X’s columns are linearly independent, so XTX\color{#3d81f6} X^T X is invertible, and

P=X(XTX)1XTP = {\color{#3d81f6} X} ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T

is well-defined.

X = np.array([[3, 0], 
              [0, 154], 
              [6, 0]])

P = X @ np.linalg.inv(X.T @ X) @ X.T
P
array([[0.2, 0. , 0.4], [0. , 1. , 0. ], [0.4, 0. , 0.8]])
P @ np.array([1, 2, 3])
array([1.4, 2. , 2.8])

P=[0.200.40100.400.8]P = \begin{bmatrix} 0.2 & 0 & 0.4 \\ 0 & 1 & 0 \\ 0.4 & 0 & 0.8 \end{bmatrix} contains the information we need to project y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}). Each row of PP tells us the right mixture of y\color{orange} \vec y’s components we need to construct the projection.

Notice that PP’s second row is [010]T\begin{bmatrix} 0 & 1 & 0 \end{bmatrix}^T. This came from the fact that X\color{#3d81f6} X’s first column had a second component of 0 while its second column had a non-zero second component but zeros in the other two components, meaning that we can scale X\color{#3d81f6} X’s second column to exactly match y\color{orange} \vec y’s second component. Change the 154 in X\color{#3d81f6} X to any other non-zero value and PP won’t change!

Additionally, if we consider some y\color{orange} \vec y that is already in colsp(X)\text{colsp}({\color{#3d81f6} X}), then multiplying it by PP doesn’t change it! For example, if we set y=[31546]\color{orange} \vec y = \begin{bmatrix} 3 \\ 154 \\ 6 \end{bmatrix} (the sum of X\color{#3d81f6} X’s columns), then Py=yP {\color{orange} \vec y} = {\color{orange} \vec y}.

P @ np.array([3, 154, 6])
array([ 3., 154., 6.])

Let’s work through some examples that develop our intuition for PP.

Example: Is PP invertible?

Suppose P=X(XTX)1XTP = {\color{#3d81f6} X} ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T exists, meaning XTX\color{#3d81f6} X^T X is invertible. Is PP invertible? If so, what is its inverse?

Example: Is PP orthogonal?

Is PP orthogonal?

Example: Is PP symmetric?

Is PP symmetric?

Example: Is PP idempotent?

Recall, an idemponent matrix AA satisfies A2=AA^2 = A. Is PP idempotent?

Example: What is PXPX, and why?

What is PXPX? What does the result mean?

Example: Rotations, Reflections, and Projections

Suppose AA is an arbitrary n×dn \times d matrix. Describe the conditions on AA that make the corresponding linear transformation f(x)=Axf(\vec x) = A \vec x a...

  1. Rotation

  2. Reflection

  3. Projection


Summary

Let’s take a step back and walk through our logic from this section once more, since it’s that important.

Suppose X{\color{#3d81f6} X} is an n×dn \times d matrix and y\color{orange} \vec y is some vector in Rn\mathbb{R}^n.

Orthogonal Projections

  1. Our goal is to find the linear combination of X\color{#3d81f6} X’s columns that is closest to y\color{orange} \vec y.

  2. This boils down to finding the vector w\vec w that minimizes yXw2\lVert {\color{orange} \vec y} - {\color{#3d81f6}X } \vec w \rVert^2.

  3. The vector w\vec w^* that minimizes yXw2\lVert {\color{orange} \vec y} - {\color{#3d81f6}X } \vec w \rVert^2 makes the resulting error vector, e=yXw{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w^*, orthogonal to the columns of X\color{#3d81f6} X.

  4. The w\vec w^* that makes the error vector orthogonal to the columns of X\color{#3d81f6} X is the one that satisfies the normal equation,

    XTXw=XTy{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w^* = {\color{#3d81f6} X^T} {\color{orange} \vec y}
  5. If XTX{\color{#3d81f6} X^T} {\color{#3d81f6} X} is invertible, which happens if and only if X\color{#3d81f6}X’s columns are linearly independent, then w\vec w^* is the unique vector

    w=(XTX)1XTy\vec w^* = ({\color{#3d81f6} X^T} {\color{#3d81f6} X})^{-1} {\color{#3d81f6} X^T} \color{orange} \vec y

    otherwise, there are infinitely many solutions to the normal equation. All of these infinitely many solutions correspond to the same projection, p=Xw\vec p = {\color{#3d81f6} X} \vec w^*. If w\vec w' is one solution (which can be found by removing the linearly dependent columns of X\color{#3d81f6} X), then all other solutions are of the form w+n\vec w' + \vec n, where n\vec n is any vector in nullsp(X)=nullsp(XTX)\text{nullsp}({\color{#3d81f6} X}) = \text{nullsp}({\color{#3d81f6} X^TX}).

The Projection Matrix

Assuming X\color{#3d81f6} X has linearly independent columns, the projection matrix is

P=X(XTX)1XTP = {\color{#3d81f6} X} ({\color{#3d81f6} X^T X})^{-1} \color{#3d81f6} X^T

PP is defined such that PyP \color{orange} \vec y is the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y. PP is symmetric and idemponent, but not invertible nor orthogonal.

We’re now finally ready to head back to the land of machine learning.