In the case where X’s columns are linearly dependent, we can’t invert XTX to solve for w∗. This means that
XTXw=XTy
has infinitely many solutions. Let’s give more thought to what these solutions actually are.
First, note that all of these solutions for w∗correspond to the same projection,p=Xw∗. The “best approximation” of y in colsp(X) is always just one vector; if there are infinitely many w∗'s, that just means there are infinitely many ways of describing that one best approximation. This is because if vectors are linearly dependent, then their linear combinations aren’t unique, but if they are linearly independent, their linear combinations are unique.
Let me drive this point home further. Let’s suppose both w1 and w2 satisfy
XTXw=XTy
Then,
XTXw1−XTXw2=y−y=0
which means that
(XTX)(w1−w2)=0
i.e. the difference between the two vectors, w1−w2, is in nullsp(XTX). But, back in Chapter 5.3, we proved that XTX and X have the same null space, meaning any vector that gets sent to 0 by X also gets sent to 0 by XTX, and vice versa.
So,
X(w1−w2)=0
too, but that just means
Xw1=Xw2
meaning that even though w1 and w2 are different-looking coefficient vectors, they both still correspond to the same linear combination of X’s columns!
Let’s see how we can apply this to an example. Let X=⎣⎡363121010⎦⎤ and y=⎣⎡213⎦⎤. This is an example of a matrix with linearly dependent columns, so there’s no unique w∗ that satisfies the normal equations.
One way to find a possible vector w∗ is to solve the normal equations. XTX is not invertible, so we can’t solve for w∗ uniquely, but we can try and find a parameterized solution.
I’m going to take a different approach here: instead, let’s just toss out the linearly dependent columns of X and solve for w∗ using the remaining columns. Then, w∗ for the full X can use the same coefficients for the linearly independent columns, but 0s for the dependent ones. Removing the linearly dependent columns does not change colsp(X) (i.e. the set of all linear combinations of X’s columns), so the projection is the same.
The easy solution is to keep columns 2 and 3, since their numbers are smallest. So, for now, let’s say
X′=⎣⎡121010⎦⎤,y=⎣⎡213⎦⎤
Here, w′=(X′TX′)−1X′Ty=[5/2−4]. I won’t bore you with the calculations; you can verify them yourself.
Now, one possible w∗ for the full X is ⎣⎡05/2−4⎦⎤, which keeps the same coefficients on columns 2 and 3 as in w′, but 0 for the column we didn’t use.
As I mentioned above, if there are infinitely many solutions to the normal equation, then the difference between any two solutions is in nullsp(XTX), which is also nullsp(X). Put another way, if ws satisfies the normal equations, then so does ws+n for any n∈nullsp(XTX).
XTXwsXTX(ws+n)=XTy=XTXws+0, by definition of null spaceXTXn=XTy+0
So, once we have one w∗, to get the rest, just add any vector in nullsp(XTX) or nullsp(X) (since those are the same spaces).
What is nullsp(X)? It’s the set of vectors v such that Xv=0.
In our particular example,
X=⎣⎡363121010⎦⎤
we see that rank(X)=2, so nullsp(X) has a dimension of 3−2=1 (by the rank-nullity theorem), so it’s going to be the span of a single vector. All we need to do now is find one vector in nullsp(X), and we will know that the null space is the set of scalar multiples of that vector.
So, since nullsp(X)=nullsp(XTX)=span⎝⎛⎩⎨⎧⎣⎡1−30⎦⎤⎭⎬⎫⎠⎞, we know that the set of all possible w∗'s is
there are infinitely many solutions to the normal equations, but they’re all of this form⎣⎡05/2−4⎦⎤+t⎣⎡1−30⎦⎤,t∈R
This is not a subspace, since it doesn’t contain the zero vector.
There’s another way to arrive at this set of possible w∗'s: we can solve the normal equations directly. I wouldn’t recommend this second approach since it’s much longer, but I’ll add it here for completeness.
The first and second equations are just scalar multiples of each other, so we can disregard one of them, and solve for a form where we can use one unknown as a parameter for the other two. To illustrate, let’s pick w1∗=t.
18t+6w2∗+2w3∗6t+2w2∗+w3∗=7=1(2)(3)
(2)−3⋅(3) gives us w3∗=−4. Plugging this into both equations gives us
These are now both the same equation; the first one is just 3 times the second. So, we can solve for w2∗ in terms of t:
w2∗=25−6t
which gives us the complete solution
w∗=⎣⎡t25−6t−4⎦⎤,t∈R
This is the exact same line as using the null space approach! Plug in t=0 to get ⎣⎡05/2−4⎦⎤, for example. Again, this is not a subspace, since it doesn’t contain the zero vector.
We now address the linearly dependent case and derive the complete solution.
So far, we’ve established that the vector in colsp(X) that is closest to y is the vector Xw∗, where w∗ is the solution to the normal equations,
XTXw∗=XTy
If XTX is invertible, then w∗ is the unique vector
w∗=(XTX)−1XTy
meaning that the vector in colsp(X) that is closest to y is
p=Xw∗=X(XTX)−1XTy
You’ll notice that the above expression also looks like a linear transformation applied to y, where y is being multiplied by the matrix
P=X(XTX)−1XT
The matrix P is called the projection matrix. In other classes, it is called the “hat matrix”, because they might use w^ instead of w∗ and y^ instead of p, and in that notation, y^=Py, so P puts a “hat” on y. (I don’t use hat notation in this class because drawing a hat on top of a vector is awkward. Doesn’t w^ look strange?)
So,
p=Xw∗=Py
shows us that there are two ways to interpret the act of projecting y onto colsp(X):
The resulting vector is some optimal linear combination of X's columns.
The resulting vector is the result of applying the linear transformation P to y.
Let’s work out an example. Suppose
X=⎣⎡30601540⎦⎤,y=⎣⎡123⎦⎤
X’s columns are linearly independent, so XTX is invertible, and
P=X(XTX)−1XT
is well-defined.
X = np.array([[3, 0],
[0, 154],
[6, 0]])
P = X @ np.linalg.inv(X.T @ X) @ X.T
P
P=⎣⎡0.200.40100.400.8⎦⎤ contains the information we need to project y onto colsp(X). Each row of P tells us the right mixture of y’s components we need to construct the projection.
Notice that P’s second row is [010]T. This came from the fact that X’s first column had a second component of 0 while its second column had a non-zero second component but zeros in the other two components, meaning that we can scale X’s second column to exactly match y’s second component. Change the 154 in X to any other non-zero value and P won’t change!
Additionally, if we consider some y that is already in colsp(X), then multiplying it by P doesn’t change it! For example, if we set y=⎣⎡31546⎦⎤ (the sum of X’s columns), then Py=y.
P @ np.array([3, 154, 6])
array([ 3., 154., 6.])
Let’s work through some examples that develop our intuition for P.
Suppose P=X(XTX)−1XT exists, meaning XTX is invertible. Is P invertible? If so, what is its inverse?
Solution
Before we do any calculations, intuitively the answer should be no. Once we’ve projected onto colsp(X), we’ve lost information, since we went from an arbitrary vector in Rn to a vector in a smaller subspace, so it shouldn’t be possible to reverse the projection. Put another way, two different vectors in Rn might have the same “shadow” onto colsp(X).
Even in the most recent example, P=⎣⎡0.200.40100.400.8⎦⎤ is not invertible, since column 3 is a multiple of column 1.
Let’s think about this a bit more from the perspective of ranks. It turns out that rank(P)=rank(X); I’ve provided a proof of this at the bottom of the solutions box, but you might want to attempt it on your own for practice.
Remember that X is an n×d matrix, meaning rank(X)≤min(n,d). X doesn’t need to have a rank of n for XTX to be invertible; it just needs to have a rank of d.
Since P is an n×n matrix, in general it won’t be the case that rank(P)=n. To illustrate, in the example above where P=⎣⎡0.200.40100.400.8⎦⎤, X was a 3×2 matrix with rank 2.
The only case in which rank(P)=n is when rank(X)=n, which also only happens when X is an n×n square matrix that is also invertible. In such a case, colsp(X)=Rn, and so we probably wouldn’t set out to project onto colsp(X) in the first place, since any vector in Rn is already a linear combination of X's columns.
Extra: Proof that rank(P)=rank(X)
We can show that this is the case by showing that P and X both have the same column spaces. This proof also helps explain why the normal equation, XTXw=XTy, always has at least one solution for w, even when XTX isn’t invertible.
Show colsp(P)⊆colsp(X), i.e. show that any vector in the column space of P is also in the column space of X.
If v∈colsp(P), then v can be written as a linear combination of P’s columns. Say v=Pu for some u∈Rn. Then,
v=Pu=Xsome vector((XTX)−1XTu)
Here, we see that if v=Pu, then v is also in X's column space. So,
v is a lin. comb. of P’s columns⟹v is a lin. comb. of X’s columns
Show colsp(X)⊆colsp(P), i.e. show that any vector in the column space of X is also in the column space of P.
This direction is a bit more involved. Let’s start by considering some vector v∈colsp(X), meaning
v=Xu
for some u∈Rd. What happens if we multiply both sides by P?
vPvPvPvPv=Xu=PXu=PX(XTX)−1XTXu=XI(XTX)−1XTXu=original definition of vXu=v
If v∈colsp(X), then v can be written as a linear combination of X's columns. Say v=Xw for some w∈Rd. Then,
v=Xw=Pw
So, if v=Xu, then v=Pv, meaning that v is also in P’s column space if it’s in X's column space. Intuitively, v=Pv means that if v is already in the span of X’s columns, then projecting it onto colsp(X) doesn’t change it.
Now that we’ve shown that colsp(P)⊆colsp(X) and colsp(X)⊆colsp(P), we can conclude that colsp(P)=colsp(X). If two sets are subsets of each other, then they must be equal.
No. Orthogonal matrices Q have the property that QTQ=QQT=I, meaning that
QT=Q−1
But, as we saw, P is not invertible in general, so it can’t satisfy this property. This tells us that Pdoes not perform a rotation; projections are not rotations. Rotations can be undone but projections can’t.
Yes. Symmetric matrices A have the property that AT=A. We can show that P satisfies this property; to do so, we’ll need to use the fact that (AB)T=BTAT.
PT=(X(XTX)−1XT)T=(XT)T((XTX)−1)TXT=X(XTX)−1XT=P
Going from the second to the third line, we used the fact that XTX is symmetric, and so is its inverse. Remember that XTX is a square matrix consisting of the dot products of the columns of X with themselves.
Intuitively, this means that P2y is the same as Py, meaning that once we’ve projected y onto colsp(X), projecting its projection p again onto colsp(X) gives us back the same p, since p is already in colsp(X).
Interpret PX as a matrix made up of Px(1), Px(2), ..., Px(d) as its columns. Px(i) is the projection of x(i) onto colsp(X), but since x(i) is already in colsp(X), projecting it again onto colsp(X) gives us back the same x(i). So, PX should just be X again.
Our goal is to find the linear combination of X’s columns that is closest to y.
This boils down to finding the vector w that minimizes ∥y−Xw∥2.
The vector w∗ that minimizes ∥y−Xw∥2 makes the resulting error vector, e=y−Xw∗, orthogonal to the columns of X.
The w∗ that makes the error vector orthogonal to the columns of X is the one that satisfies the normal equation,
XTXw∗=XTy
If XTX is invertible, which happens if and only if X’s columns are linearly independent, then w∗ is the unique vector
w∗=(XTX)−1XTy
otherwise, there are infinitely many solutions to the normal equation. All of these infinitely many solutions correspond to the same projection, p=Xw∗. If w′ is one solution (which can be found by removing the linearly dependent columns of X), then all other solutions are of the form w′+n, where n is any vector in nullsp(X)=nullsp(XTX).