Skip to article frontmatterSkip to article content

2.10. Projections, Revisited

We’re almost ready to return to our original motivation for studying linear algebra, which was to perform linear regression using multiple input variables. This section outlines the final piece of the puzzle.


Approximating using a Single Vector

In Chapter 2.3, we introduced the approximation problem, which asked:

Among all vectors of the form kvk \color{#3d81f6} \vec v, which one is closest to u{\color{orange}\vec u}?

We now know the answer is the vector p\color{#004d40} \vec p, where

p=(uvvv)v{\color{#004d40} \vec p} = \left( \frac{{\color{orange}\vec u} \cdot \color{#3d81f6} \vec v}{{\color{#3d81f6}\vec v} \cdot {\color{#3d81f6}\vec v}} \right) \color{#3d81f6} \vec v

p\color{#004d40} \vec p is called the orthogonal projection of u\color{orange} \vec u onto v\color{#3d81f6} \vec v.

Loading...

As we’ve studied, the resulting error vector,

e=up{\color{#d81a60} \vec e} = {\color{orange} \vec u} - {\color{#004d40} \vec p}

is orthogonal to v\color{#3d81f6} \vec v.

In our original look at the approximation problem, we were approximating u\color{orange} \vec u using a scalar multiple of just a single vector, v\color{#3d81f6} \vec v. The set of all scalar multiples of v\color{#3d81f6} \vec v, denoted by span({v})\text{span}(\{ {\color{#3d81f6} \vec v}\}), is a line in Rn\mathbb{R}^n.

Key idea: instead of projecting onto the subspace spanned by just a single vector, how might we project onto the subspace spanned by multiple vectors?


Approximating using Multiple Vectors

Equipped with our understanding of linear independence, spans, subspaces, and column spaces, we’re ready to tackle a more advanced version of the approximation problem. I’m going to use slightly different variables to pose the problem than I did in the single vector case, to make the transition back into data, models, and loss functions a little smoother.

All three statements at the bottom of the box above are asking the exact same question; I’ve presented all three forms so that you see more clearly how the ideas of spans, column spaces, and matrix-vector multiplication fit together. I will tend to refer to the latter two versions of the problem the most. In what follows, suppose X\color{#3d81f6} X is an n×dn \times d matrix whose columns x(1){\color{#3d81f6} \vec x^{(1)}}, x(2){\color{#3d81f6} \vec x^{(2)}}, ..., x(d){\color{#3d81f6} \vec x^{(d)}} are the building blocks we want to approximate y\color{orange} \vec y with.

First, let’s get the trivial case out of the way. If ycolsp(X){\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X}), then the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y is just y\color{orange} \vec y itself. If that’s the case, there exists some w\vec w such that y=Xw{\color{orange} \vec y} = {\color{#3d81f6} X} \vec w exactly. This w\vec w is unique only if X{\color{#3d81f6} X}'s columns are linearly independent; otherwise, there will be infinitely many good w\vec w’s.

But, that’s not the case I’m really interested in. I care more about when y\color{orange} \vec y is not in colsp(X)\text{colsp}({\color{#3d81f6} X}). (Remember, this is the case we’re interested in when we’re doing linear regression: usually, it’s not possible to make our predictions 100% correct, and we’ll have to settle for some error.)

Then what?

In general, colsp(X)\text{colsp}({\color{#3d81f6} X}) is an rr-dimensional subspace of Rn\mathbb{R}^n, where r=rank(X)r = \text{rank}({\color{#3d81f6} X}). In the diagram below, I’ve used a plane to represent colsp(X)\text{colsp}({\color{#3d81f6} X}); just remember that X{\color{#3d81f6} X} may have more than 3 rows or columns.

Loading...

We’re searching for the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y, i.e. whose distance from y\color{orange} \vec y is minimal. Remember that colsp(X)\text{colsp}({\color{#3d81f6} X}) is the set of linear combinations of X{\color{#3d81f6} X}'s columns, so it’s the set of all vectors of the form Xw{\color{#3d81f6} X} \vec w for some wRd\vec w \in \mathbb{R}^d.

So, our problem boils down to finding the w\vec w that minimizes the distance between y{\color{orange} \vec y} and Xw{\color{#3d81f6} X} \vec w, i.e. the w\vec w that minimizes the norm of the error vector e=yXw{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w.

e=yXw\lVert {\color{#d81a60} \vec e} \rVert = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert

f(w)=yXwf(\vec w) = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert is a function of w\vec w only; X\color{#3d81f6} X and y\color{orange} \vec y should be thought of as fixed. There are two ways we’ll minimize f(w)f(\vec w):

  1. Using a geometric argument involving orthogonality, as we did in the single vector case.

  2. Using calculus. This is more involved than before, since the input variable is a vector, not a scalar, but it can be done, as we’ll see in Chapter 4.

What does our intuition tell us? Extending the single vector case, we expect the vector in colsp(X)\text{colsp}({\color{#3d81f6} X}) that is closest to y\color{orange} \vec y to be the orthogonal projection of y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}).

More to come!