Skip to article frontmatterSkip to article content

2.3. Projection, Part 1

The Approximation Problem

Suppose u{\color{orange}\vec u} and v{\color{#3d81f6}\vec v} are any two vectors in Rn\mathbb{R}^n. (When I say this, I just mean that they both have the same number of components.)

Let’s think about all possible vectors of the form kv\color{#004d40} k\vec v, where kk can be any scalar. Any vector of the form kv\color{#004d40} k\vec v is a scalar multiple of v{\color{#3d81f6}\vec v}, and points in the same direction as v{\color{#3d81f6}\vec v} (if k>0k > 0) or the opposite direction (if k<0k < 0). What is different about these scalar multiples is how long they are.

To get a sense of what I mean by this, play with the slider for kk below. There are three vectors being visualized: some u\color{orange}\vec u, some v\color{#3d81f6}\vec v, and kv\color{#004d40} k\vec v, which depends on the value of kk you choose.

Loading...

Notice that the set of vectors of the form kv\color{#004d40} k\vec v fill out a line. So really, what we’re asking is which vector on this line is closest to u\color{orange}\vec u.

In terms of angles, if θ\theta is the angle between u\color{orange}\vec u and v\color{#3d81f6}\vec v, then the angle between u\color{orange}\vec u and kv\color{#004d40} k\vec v is either θ\theta (if k>0k > 0) or 180θ180^{\circ} - \theta (if k<0k < 0). So changing kk doesn’t change how “similar” u\color{orange}\vec u and kv\color{#004d40} k\vec v are in the cosine similarity sense.

But, some choices of kk will make kv\color{#004d40} k\vec v closer to u\color{orange}\vec u than others. I call this the approximation problem: how well can we recreate, or approximate, u\color{orange}\vec u using a scalar multiple of v\color{#3d81f6}\vec v? It turns out that linear regression is intimately related to this problem. Previously, we were trying to approximate commute times as best as we could using a linear function of departure times.

Let’s be more precise by what we mean by “closer”. For any value of kk, we can measure the error of our approximation by the length of the error vector, e=ukv{\color{#d81b60} \vec e } = {\color{orange}\vec u} - \color{#004d40} k\vec v.

Loading...

Continue to play with the slider above for kk. How do you get the length of the error vector to be as small as possible?

Intuitively, it seems that to get the error vector e\color{#d81b60} \vec e to be as short as possible, we should make it orthogonal to v\color{#3d81f6}\vec v. Since we can control kk, we can control ukv{\color{orange}\vec u} - \color{#004d40} k\vec v, so we can make the error vector orthogonal to v\color{#3d81f6}\vec v by choosing the right kk.


Orthogonal Projections

Our goal is to minimize the length of the error vector e\lVert {\color{#d81b60}\vec e} \rVert.

e\lVert {\color{#d81b60}\vec e} \rVert

This is the same as minimizing

ukv\lVert {\color{orange}\vec u} - {\color{#004d40} k\vec v} \rVert

One way to approach this problem is to treat the above expression as a function of kk and find the value of kk that minimizes it through calculus. You’ll do this in Homework 3. I’ll show you a more geometric approach here.

We’ve guessed, but not yet shown that the shortest possible error vector is the one that is orthogonal to v\color{#3d81f6}\vec v. Let kok_o be the value of kk that makes the error vector eo=ukov{\color{#d81b60}\vec e_o} = {\color{orange}\vec u} - \color{#004d40} k_o\vec v orthogonal to v\color{#3d81f6}\vec v. Here, we’ll prove that kok_o is the “best” choice of kk by showing that any other choice of kk will result in an error vector that is longer than eo\color{#d81b60}\vec e_o. Think of this as a proof by contradiction (if you’re familiar with that idea; no worries if not).

For comparison, let kk' be some other value of kk, and let its error vector be e=ukv\vec e' = {\color{orange}\vec u} - k' \color{#3d81f6}\vec v.

Loading...

I’ve drawn kvk' \vec v in gray. Arbitrarily, I’ve shown it as being shorter than kov\color{#004d40} k_o \vec v, but I could have drawn it as being longer and the argument would be the same. The prime has nothing to do with derivatives, by the way – it’s just a new variable.

The vectors kvk' \vec v and kov\color{#004d40} k_o \vec v, along with their corresponding error vectors, create a right-angled triangle, shaded in gold above. This triangle has a hypotenuse of e\vec e' and legs of eo\color{#d81b60} \vec e_o and kvkovk' \vec v - \color{#004d40} k_o \vec v.

Applying the Pythagorean theorem to this triangle gives us

e2=eo2+kvkov2>0\lVert \vec e' \rVert^2 = \lVert {\color{#d81b60}\vec e_o} \rVert^2 + \underbrace{\lVert k' \vec v - {\color{#004d40} k_o \vec v} \rVert^2}_{> 0}

kvkov20\lVert k' \vec v - {\color{#004d40} k_o \vec v} \rVert^2 \geq 0, with equality only when we choose k=kok' = k_o, but we’ve assumed that kkok' \neq k_o.

This implies that

e2=eo2+some positive number\lVert \vec e' \rVert^2 = \lVert {\color{#d81b60}\vec e_o} \rVert^2 + \text{some positive number}

which means that e2>eo2\lVert \vec e' \rVert^2 > \lVert {\color{#d81b60}\vec e_o} \rVert^2.

In other words, if kkok' \neq k_o, then e>eo\lVert \vec e' \rVert > \lVert {\color{#d81b60}\vec e_o} \rVert. Thus, the error vector eo\color{#d81b60}\vec e_o is shorter than any other error vector e\vec e', and the best choice of kk is kok_o!

What was the point of all this again?

We know that the answer is kv\color{#004d40} k^* \vec v, where kk^* is the value of kk that makes the error vector e=ukv{\color{#d81b60}\vec e} = {\color{orange}\vec u} - \color{#004d40} k^* \vec v orthogonal to v\color{#3d81f6}\vec v. (I’ve switched from calling this optimal scalar kok_o to kk^*; kok_o was a name I used in the proof above, but more generally, “optimal” values are starred for our purposes).

Let’s now find the value of kk^*, in terms of just u\color{orange} \vec u and v\color{#3d81f6} \vec v. If e\color{#d81b60}\vec e is orthogonal to v\color{#3d81f6}\vec v, then ve=0{\color{#3d81f6}\vec v} \cdot {\color{#d81b60}\vec e} = 0.

ve=0v(ukv)=0vuvkv=0vuk(vv)=0vu=k(vv)k=uvvv\begin{align*} {\color{#3d81f6}\vec v} \cdot {\color{#d81b60}\vec e} &= 0 \\ {\color{#3d81f6}\vec v} \cdot ({\color{orange}\vec u} - {\color{#004d40} k^* \vec v}) &= 0 \\ {\color{#3d81f6}\vec v} \cdot {\color{orange}\vec u} - {\color{#3d81f6}\vec v} \cdot {\color{#004d40} k^* \vec v} &= 0 \\ {\color{#3d81f6}\vec v} \cdot {\color{orange}\vec u} - {\color{#004d40} k^*} ({\color{#3d81f6}\vec v} \cdot {\color{#3d81f6} \vec v}) &= 0 \\ {\color{#3d81f6}\vec v} \cdot {\color{orange}\vec u} &= {\color{#004d40} k^*} ({\color{#3d81f6}\vec v} \cdot {\color{#3d81f6} \vec v}) \\ {\color{#004d40} k^*} &= \boxed{\frac{{\color{orange}\vec u} \cdot {\color{#3d81f6}\vec v}}{{\color{#3d81f6}\vec v} \cdot {\color{#3d81f6}\vec v}}} \end{align*}

The boxed value above is a scalar. It tells us the optimal amount to multiply v\color{#3d81f6}\vec v by to get the best approximation of u\color{orange}\vec u. Once we multiply that boxed scalar, kk^*, by the vector v\color{#3d81f6}\vec v, we get what’s called the orthogonal projection of u\color{orange}\vec u onto v\color{#3d81f6}\vec v.

Among all vectors of the form kvk\color{#3d81f6}\vec v, p\vec p above is the one that is closest to u\color{orange}\vec u.

Why “orthogonal projection”? “Orthogonal” comes from the fact that p\vec p’s error vector e=up{\color{#d81b60}\vec e} = {\color{orange}\vec u} - \vec p is orthogonal to v\color{#3d81f6}\vec v. “Projection” comes from the intuition you should have that p\vec p is the shadow of u\color{orange}\vec u onto v\color{#3d81f6}\vec v.

Loading...

We’ve defined the error vector as e=up{\color{#d81b60}\vec e} = {\color{orange}\vec u} - \vec p, and we know that e{\color{#d81b60}\vec e} is orthogonal to v\color{#3d81f6}\vec v. Rearranging the definition of the error vector gives us

u=pparallel to v+eorthogonal to v{\color{orange} \vec u} = \underbrace{\vec p}_\text{parallel to $\color{#3d81f6}\vec v$} + \underbrace{{\color{#d81b60}\vec e}}_\text{orthogonal to $\color{#3d81f6}\vec v$}

All this says is that u\color{orange}\vec u is the sum of:

  • p\color{#004d40}\vec p, which is parallel to v\color{#3d81f6}\vec v (by definition of orthogonal projection)

  • e\color{#d81b60}\vec e, which is orthogonal to v\color{#3d81f6}\vec v (by definition of error vector)

Sometimes, we call this the orthogonal decomposition of u\color{orange}\vec u with respect to v\color{#3d81f6}\vec v. I’ll speak more about decompositions later in this section.


Examples

Let’s make things concrete by working through several examples. Each one was carefully chosen to illustrate something in particular.

We will work through several of these examples in lecture; attempt the ones that we don’t on your own.

Example: Fundamentals

Let u=[122]\vec u = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} and v=[111]\vec v = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.

  1. Find the orthogonal projection of u\vec u onto v\vec v.

  2. Find the error vector, i.e. the vector e=up\vec e = \vec u - \vec p, and verify that it is orthogonal to v\vec v.

  3. What is the length of the error vector (i.e. the projection error)?


Example: The Line Perspective

Consider the points p1=(1,2,2)p_1 = (1, 2, 2) and p2=(1,1,1)p_2 = (1, 1, 1) in R3\mathbb{R}^3.

What is the shortest distance between p1p_1 and the line that passes through p2p_2 and the origin, (0,0,0)(0, 0, 0)?


Example: Which Order?

Let u=[122]\vec u = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} and v=[111]\vec v = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.

In the first example, we found the orthogonal projection of u\vec u onto v\vec v.

Now, do the opposite: find the orthogonal projection of v\vec v onto u\vec u.


Example: Unit Vectors

Let u=[412]\vec u = \begin{bmatrix} 4 \\ -1 \\ 2 \end{bmatrix}, v=[263]\vec v = \begin{bmatrix} 2 \\ 6 \\ -3 \end{bmatrix} and V=[2/76/73/7]\vec V = \begin{bmatrix} 2 / 7 \\ 6 / 7 \\ -3 / 7 \end{bmatrix}.

  1. Find the orthogonal projection of u\vec u onto v\vec v.

  2. Find the orthogonal projection of u\vec u onto V\vec V.

  3. What do you notice about your answers to the above two parts?


Example: Unit Vectors, Continued

Suppose u,vRn\vec u, \vec v \in \mathbb{R}^n. Let θ\theta be the angle between u\vec u and v\vec v.

Show that the orthogonal projection of u\vec u onto v\vec v is equal to

p=(ucosθ)vv\vec p = \left( \lVert \vec u \rVert \cos \theta \right) \frac{\vec v}{\lVert \vec v \rVert}

This is not a formula we’d use to actually compute p\vec p, since finding cosθ\cos \theta is harder than using the dot product-based formula from above. But, what does this formula tell us about the relationship between u\vec u and p\vec p?


Example: Projecting onto an Orthogonal Vector

Let u=[412]\vec u = \begin{bmatrix} 4 \\ -1 \\ 2 \end{bmatrix} and v=[306]\vec v = \begin{bmatrix} -3 \\ 0 \\ 6 \end{bmatrix}.

u\vec u and v\vec v are orthogonal. What does this say about the orthogonal projection of u\vec u onto v\vec v?


Orthogonal Decomposition

To motivate the next section, let’s consider another example.

Let u=[35]\color{orange} \vec u = \begin{bmatrix} 3 \\ 5 \end{bmatrix}, v1=[62]\color{#3d81f6} \vec v_1 = \begin{bmatrix} 6 \\ -2 \end{bmatrix}, and v2=[13]\color{#3d81f6} \vec v_2 = \begin{bmatrix} 1 \\ 3 \end{bmatrix}.

Image produced in Jupyter

Notice that v1\color{#3d81f6}\vec v_1 and v2\color{#3d81f6}\vec v_2 are orthogonal – that’s important to what we’re about to discover.

Let’s find the orthogonal projection of u\color{orange}\vec u onto v1\color{#3d81f6}\vec v_1 (called p1\color{#004d40}\vec p_1) and the orthogonal projection of u\color{orange}\vec u onto v2\color{#3d81f6}\vec v_2 (called p2\color{#004d40}\vec p_2).

p1=uv1v1v1v1=840v1=15v1=[6/52/5]{\color{#004d40}\vec p_1} = \frac{{\color{orange}\vec u} \cdot \color{#3d81f6}\vec v_1}{\color{#3d81f6}\vec v_1 \cdot \color{#3d81f6}\vec v_1} {\color{#3d81f6}\vec v_1} = \frac{8}{40} {\color{#3d81f6}\vec v_1} = \frac{1}{5} {\color{#3d81f6}\vec v_1} = {\color{#004d40}\begin{bmatrix} 6 / 5 \\ -2 / 5 \end{bmatrix}}
p2=uv2v2v2v2=1810v2=95v2=[9/527/5]{\color{#004d40}\vec p_2} = \frac{{\color{orange}\vec u} \cdot \color{#3d81f6}\vec v_2}{\color{#3d81f6}\vec v_2 \cdot \color{#3d81f6}\vec v_2} {\color{#3d81f6}\vec v_2} = \frac{18}{10} {\color{#3d81f6}\vec v_2} = \frac{9}{5} {\color{#3d81f6} \vec v_2} = {\color{#004d40}\begin{bmatrix} 9 / 5 \\ 27 / 5 \end{bmatrix}}

Notice that u\color{orange}\vec u is the sum of p1\color{#004d40}\vec p_1 and p2\color{#004d40}\vec p_2!

u=p1+p2=15v1+95v2linear combination=[6/52/5]+[9/527/5]=[35]{\color{orange}\vec u} = {\color{#004d40}\vec p_1} + {\color{#004d40}\vec p_2} = \underbrace{\frac{1}{5} {\color{#3d81f6}\vec v_1} + \frac{9}{5} {\color{#3d81f6}\vec v_2}}_{\text{linear combination}} = \begin{bmatrix} 6 / 5 \\ -2 / 5 \end{bmatrix} + \begin{bmatrix} 9 / 5 \\ 27 / 5 \end{bmatrix} = \begin{bmatrix} 3 \\ 5 \end{bmatrix}
Image produced in Jupyter

Why is the sum of p1\color{#004d40}\vec p_1 and p2\color{#004d40}\vec p_2 equal to u\color{orange}\vec u? Earlier, I mentioned that we can use orthogonal projections to decompose vectors. Here, when we project u\color{orange}\vec u onto v1\color{#3d81f6}\vec v_1, the corresponding error vector e1\color{#d81b60}\vec e_1 is orthogonal to v1\color{#3d81f6}\vec v_1.

u=p1parallel to v1+e1orthogonal to v1{\color{orange} \vec u} = \underbrace{{\color{#004d40}\vec p_1}}_{\text{parallel to } \color{#3d81f6}\vec v_1} + \underbrace{{\color{#d81b60}\vec e_1}}_{\text{orthogonal to } \color{#3d81f6}\vec v_1}

By projecting u\color{orange}\vec u onto v2\color{#3d81f6}\vec v_2, we can recreate the error vector exactly, meaning e1=p2{\color{#d81b60}\vec e_1} = \color{#004d40}\vec p_2.

Taking a step back, the fact that v1\color{#3d81f6} v_1 and v2\color{#3d81f6} v_2 are orthogonal meant that writing u\color{orange}\vec u as a linear combination of v1\color{#3d81f6} v_1 and v2\color{#3d81f6} v_2 was easy.

a1[62]v1+a2[13]v2=[35]ua_1 {\color{#3d81f6} \underbrace{\begin{bmatrix} 6 \\ -2 \end{bmatrix}}_{\vec v_1}} + a_2 {\color{#3d81f6} \underbrace{\begin{bmatrix} 1 \\ 3 \end{bmatrix}}_{\vec v_2}} = \color{orange} \underbrace{\begin{bmatrix} 3 \\ 5 \end{bmatrix}}_{\vec u}

If v1\color{#3d81f6} v_1 and v2\color{#3d81f6} v_2 were not orthogonal, then writing u\color{orange}\vec u as a linear combination of v1\color{#3d81f6} v_1 and v2\color{#3d81f6} v_2 would have involved solving a system of 2 equations and 2 unknowns, as we’ve had to do in previous sections.

For instance, if we keep [35]\color{orange} \begin{bmatrix} 3 \\ 5 \end{bmatrix} and look at x1=[12]\color{#3d81f6} \vec x_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix} and x2=[21]\color{#3d81f6} \vec x_2 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, we have that

  • the projection of u\color{orange}\vec u onto x1\color{#3d81f6} \vec x_1 is (ux1x1x1)x1=135x1 \left( \frac{{\color{orange}\vec u} \cdot \color{#3d81f6} \vec x_1}{{\color{#3d81f6} \vec x_1} \cdot \color{#3d81f6} \vec x_1} \right) {\color{#3d81f6} \vec x_1} = \frac{13}{5} \color{#3d81f6} \vec x_1

  • the projection of u\color{orange}\vec u onto x2\color{#3d81f6} \vec x_2 is (ux2x2x2)x2=115x2 \left( \frac{{\color{orange}\vec u} \cdot \color{#3d81f6} \vec x_2}{{\color{#3d81f6} \vec x_2} \cdot \color{#3d81f6} \vec x_2} \right) {\color{#3d81f6} \vec x_2} = \frac{11}{5} \color{#3d81f6} \vec x_2

but

135x1+115x2=135[12]+115[21]=[13/5+22/526/5+11/5]=[77.4]u\frac{13}{5} {\color{#3d81f6} \vec x_1} + \frac{11}{5} {\color{#3d81f6} \vec x_2} = \frac{13}{5} \begin{bmatrix} 1 \\ 2 \end{bmatrix} + \frac{11}{5} \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 13 / 5 + 22 / 5 \\ 26 / 5 + 11 / 5 \end{bmatrix} = \begin{bmatrix} 7 \\ 7.4 \end{bmatrix} \neq {\color{orange}\vec u}

I’d like to provide a more general “theorem”, on when you can use orthogonal projections to more easily write a vector u\color{orange}\vec u as a linear combination of vectors v1\color{#3d81f6} \vec v_1, v2\color{#3d81f6} \vec v_2, ..., vd\color{#3d81f6} \vec v_d, but we’ll need to first study the idea of a basis. That’s to come.

This section has been light on activities, since it provided many examples that we’ll work through in lecture. But, here’s one to tie this last point together.


What’s Next?

The motivating problem for this section was the approximation problem, which asked us to find the best approximation of a vector u\color{orange}\vec u using only a scalar multiple of a vector v\color{#3d81f6}\vec v.

The next natural step is to consider the case where we want to approximate u\color{orange}\vec u using a linear combination of more than one vector, v1\color{#3d81f6} \vec v_1, v2\color{#3d81f6} \vec v_2, ..., vd\color{#3d81f6} \vec v_d. Why? Remember, this all connects back to the problem of linear regression. The more vectors we have as “building blocks” in our linear combination, the more features our model will be able to use. (I haven’t made the connection from linear algebra to linear regression yet, but just know this is why we’re studying projections.)

For example, let’s consider u=[136]\color{orange}\vec u = \begin{bmatrix} 1 \\ 3 \\ 6 \end{bmatrix}, v1=[101]\color{#3d81f6} \vec v_1 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}, and v2=[221]\color{#3d81f6} \vec v_2 = \begin{bmatrix} 2 \\ 2 \\ 1 \end{bmatrix}. Among all linear combinations of v1\color{#3d81f6} \vec v_1 and v2\color{#3d81f6} \vec v_2, which one is closest to u\color{orange}\vec u?

To answer this question, we’d need to find the scalars a1a_1 and a2a_2 such that the error vector

e=u(a1v1+a2v2){\color{#d81b60}\vec e} = {\color{orange}\vec u} - {(a_1 {\color{#3d81f6} \vec v_1} + a_2 {\color{#3d81f6} \vec v_2})}

has minimal length, which presumably happens when e\color{#d81b60}\vec e is orthogonal to both v1\color{#3d81f6} \vec v_1 and v2\color{#3d81f6} \vec v_2.

Travelling down this road, we might be able to find the values of a1a_1 and a2a_2 that minimize the length of e\color{#d81b60}\vec e. But then we’ll want to ask how we can do this for any u\color{orange}\vec u and any set of vectors v1\color{#3d81f6} \vec v_1, v2\color{#3d81f6} \vec v_2, ..., vd\color{#3d81f6} \vec v_d, and it seems like we’ll need a more general solution. In general, to find the “best” approximation of u\color{orange}\vec u using a linear combination of v1\color{#3d81f6} \vec v_1, v2\color{#3d81f6} \vec v_2, ..., vd\color{#3d81f6} \vec v_d, we’ll need to know about matrices. We’ll introduce matrices in Chapter 2.5.

Instead, in Chapter 2.4, we will set aside the goal of projections temporarily, and instead focus on truly understanding the set of possible linear combinations of a given set of vectors. For example, the vectors v1=[100]\color{#3d81f6} \vec v_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, v2=[518]\color{#3d81f6} \vec v_2 = \begin{bmatrix} -5 \\ 1 \\ 8 \end{bmatrix} from earlier define a plane. So, asking which linear combination of v1\color{#3d81f6} \vec v_1 and v2\color{#3d81f6} \vec v_2 is closest to u\color{orange}\vec u is equivalent to asking which point on the plane is closest to u\color{orange}\vec u.

Loading...

Chapter 2.4 will answer the questions, “why do v1\color{#3d81f6} \vec v_1 and v2\color{#3d81f6} \vec v_2 define a plane?”, “which plane do they define?”, and “in general, what do v1\color{#3d81f6} \vec v_1, v2\color{#3d81f6} \vec v_2, ..., vd\color{#3d81f6} \vec v_d, all in Rn\mathbb{R}^n, define?”