The Approximation Problem¶
Suppose and are any two vectors in . (When I say this, I just mean that they both have the same number of components.)
Let’s think about all possible vectors of the form , where can be any scalar. Any vector of the form is a scalar multiple of , and points in the same direction as (if ) or the opposite direction (if ). What is different about these scalar multiples is how long they are.
To get a sense of what I mean by this, play with the slider for below. There are three vectors being visualized: some , some , and , which depends on the value of you choose.
Notice that the set of vectors of the form fill out a line. So really, what we’re asking is which vector on this line is closest to .
In terms of angles, if is the angle between and , then the angle between and is either (if ) or (if ). So changing doesn’t change how “similar” and are in the cosine similarity sense.
But, some choices of will make closer to than others. I call this the approximation problem: how well can we recreate, or approximate, using a scalar multiple of ? It turns out that linear regression is intimately related to this problem. Previously, we were trying to approximate commute times as best as we could using a linear function of departure times.
Let’s be more precise by what we mean by “closer”. For any value of , we can measure the error of our approximation by the length of the error vector, .
Continue to play with the slider above for . How do you get the length of the error vector to be as small as possible?
Intuitively, it seems that to get the error vector to be as short as possible, we should make it orthogonal to . Since we can control , we can control , so we can make the error vector orthogonal to by choosing the right .
Orthogonal Projections¶
Our goal is to minimize the length of the error vector .
This is the same as minimizing
One way to approach this problem is to treat the above expression as a function of and find the value of that minimizes it through calculus. You’ll do this in Homework 3. I’ll show you a more geometric approach here.
We’ve guessed, but not yet shown that the shortest possible error vector is the one that is orthogonal to . Let be the value of that makes the error vector orthogonal to . Here, we’ll prove that is the “best” choice of by showing that any other choice of will result in an error vector that is longer than . Think of this as a proof by contradiction (if you’re familiar with that idea; no worries if not).
For comparison, let be some other value of , and let its error vector be .
I’ve drawn in gray. Arbitrarily, I’ve shown it as being shorter than , but I could have drawn it as being longer and the argument would be the same. The prime has nothing to do with derivatives, by the way – it’s just a new variable.
The vectors and , along with their corresponding error vectors, create a right-angled triangle, shaded in gold above. This triangle has a hypotenuse of and legs of and .
Applying the Pythagorean theorem to this triangle gives us
, with equality only when we choose , but we’ve assumed that .
This implies that
which means that .
In other words, if , then . Thus, the error vector is shorter than any other error vector , and the best choice of is !
What was the point of all this again?
We know that the answer is , where is the value of that makes the error vector orthogonal to . (I’ve switched from calling this optimal scalar to ; was a name I used in the proof above, but more generally, “optimal” values are starred for our purposes).
Let’s now find the value of , in terms of just and . If is orthogonal to , then .
The boxed value above is a scalar. It tells us the optimal amount to multiply by to get the best approximation of . Once we multiply that boxed scalar, , by the vector , we get what’s called the orthogonal projection of onto .
Among all vectors of the form , above is the one that is closest to .
Why “orthogonal projection”? “Orthogonal” comes from the fact that ’s error vector is orthogonal to . “Projection” comes from the intuition you should have that is the shadow of onto .
We’ve defined the error vector as , and we know that is orthogonal to . Rearranging the definition of the error vector gives us
All this says is that is the sum of:
, which is parallel to (by definition of orthogonal projection)
, which is orthogonal to (by definition of error vector)
Sometimes, we call this the orthogonal decomposition of with respect to . I’ll speak more about decompositions later in this section.
Examples¶
Let’s make things concrete by working through several examples. Each one was carefully chosen to illustrate something in particular.
We will work through several of these examples in lecture; attempt the ones that we don’t on your own.
Example: Fundamentals¶
Let and .
Find the orthogonal projection of onto .
Find the error vector, i.e. the vector , and verify that it is orthogonal to .
What is the length of the error vector (i.e. the projection error)?
Solution
Part 1
The orthogonal projection of onto is given by
Following the same steps as in the previous example, we have
So, the orthogonal projection of onto is
Part 2
The error vector is
To check if it’s orthogonal to , we compute their dot product; we’re hoping it’s 0.
So, the error vector is orthogonal to .
Part 3
The length of the error vector is
We might say is the projection error. Another way of thinking of it is as the shortest distance from the point to the line that passes through and .
Example: The Line Perspective¶
Consider the points and in .
What is the shortest distance between and the line that passes through and the origin, ?
Solution
The answer is . This example didn’t require any addition math beyond the previous example; it just serves to remind you of the geometry of the situation. The set of all possible scalar multiples of fill out a line in , and that line passes through and .
Why does that line pass through ? Consider the vector – it’s the zero vector!
Example: Which Order?¶
Let and .
In the first example, we found the orthogonal projection of onto .
Now, do the opposite: find the orthogonal projection of onto .
Solution
Now, we’re projecting onto , which means our answer is going to be a multiple of , not as in the first part.
The orthogonal projection of onto is given by
The formula for the scalar in front of is the same as in Part 1, but with all ’s replaced by ’s and vice versa. The numerator is the same, since . The denominator is different; just remember that the denominator is the squared norm of the vector you’re projecting onto.
So, the orthogonal projection of onto is
Note that the corresponding error vector, , is orthogonal to (not ), since is the vector we projected onto.
Example: Unit Vectors¶
Let , and .
Find the orthogonal projection of onto .
Find the orthogonal projection of onto .
What do you notice about your answers to the above two parts?
Solution
Part 1
We know that the orthogonal projection of onto is given by
Let’s compute the relevant dot products.
So, the orthogonal projection of onto is
Part 2
Now, we need to find .
Let’s compute the relevant dot products.
So, the orthogonal projection of onto is
Part 3
Notice that in both parts, the orthogonal projection is the same! This is not a coincidence. Both vectors point in the same direction, meaning the set of possible vectors of the form is the same as the set of possible vectors of the form . Another way to think about this is that they both span the same line in through the origin.
The difference between and is that is a unit vector in the direction of , meaning that it points in the same direction as but has rather than .
What’s different is the scalar we need to multiply each vector by to get the orthogonal projection. In the case of the unit vector , the number in front of is , but since , this simplifies to .
Example: Unit Vectors, Continued¶
Suppose . Let be the angle between and .
Show that the orthogonal projection of onto is equal to
This is not a formula we’d use to actually compute , since finding is harder than using the dot product-based formula from above. But, what does this formula tell us about the relationship between and ?
Solution
Let’s start with the original formula for the orthogonal projection of onto .
Using the fact that and , we can rewrite the formula as
The parentheses around don’t change the calculation, but they help with the interpretation. This shows us that we can think of the orthogonal projection of onto as a vector with:
a length of
in the direction of , which is a unit vector in the direction of
Example: Projecting onto an Orthogonal Vector¶
Let and .
and are orthogonal. What does this say about the orthogonal projection of onto ?
Solution
Since , the orthogonal projection of onto is the zero vector, .
Intuitively, and travel in totally different different directions. Travelling any amount of will take you further away from . So, it’s best to stick with the zero vector, .

Notice that and are orthogonal – that’s important to what we’re about to discover.
Let’s find the orthogonal projection of onto (called ) and the orthogonal projection of onto (called ).
Notice that is the sum of and !

Why is the sum of and equal to ? Earlier, I mentioned that we can use orthogonal projections to decompose vectors. Here, when we project onto , the corresponding error vector is orthogonal to .
By projecting onto , we can recreate the error vector exactly, meaning .
Taking a step back, the fact that and are orthogonal meant that writing as a linear combination of and was easy.
If and were not orthogonal, then writing as a linear combination of and would have involved solving a system of 2 equations and 2 unknowns, as we’ve had to do in previous sections.
For instance, if we keep and look at and , we have that
the projection of onto is
the projection of onto is
but
I’d like to provide a more general “theorem”, on when you can use orthogonal projections to more easily write a vector as a linear combination of vectors , , ..., , but we’ll need to first study the idea of a basis. That’s to come.
This section has been light on activities, since it provided many examples that we’ll work through in lecture. But, here’s one to tie this last point together.
What’s Next?¶
The motivating problem for this section was the approximation problem, which asked us to find the best approximation of a vector using only a scalar multiple of a vector .
The next natural step is to consider the case where we want to approximate using a linear combination of more than one vector, , , ..., . Why? Remember, this all connects back to the problem of linear regression. The more vectors we have as “building blocks” in our linear combination, the more features our model will be able to use. (I haven’t made the connection from linear algebra to linear regression yet, but just know this is why we’re studying projections.)
For example, let’s consider , , and . Among all linear combinations of and , which one is closest to ?
To answer this question, we’d need to find the scalars and such that the error vector
has minimal length, which presumably happens when is orthogonal to both and .
Travelling down this road, we might be able to find the values of and that minimize the length of . But then we’ll want to ask how we can do this for any and any set of vectors , , ..., , and it seems like we’ll need a more general solution. In general, to find the “best” approximation of using a linear combination of , , ..., , we’ll need to know about matrices. We’ll introduce matrices in Chapter 2.5.
Instead, in Chapter 2.4, we will set aside the goal of projections temporarily, and instead focus on truly understanding the set of possible linear combinations of a given set of vectors. For example, the vectors , from earlier define a plane. So, asking which linear combination of and is closest to is equivalent to asking which point on the plane is closest to .
Chapter 2.4 will answer the questions, “why do and define a plane?”, “which plane do they define?”, and “in general, what do , , ..., , all in , define?”