Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

10.3. The Best Direction

Introduction

As we saw in Chapter 10.2, the singular value decomposition of a matrix XX into X=UΣVTX = U \Sigma V^T can be used to find the best low-rank approximation of XX, which has applications in image compression, for example. But, hiding in the singular value decomposition is another, more data-driven application: dimensionality reduction.

Consider, for example, the dataset below. Each row corresponds to a penguin.

Loading...

Each penguin has three categorical features - species, island, and sex - and four numerical features - bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. If we restrict our attention to these numerical features, we have a 333×4333 \times 4 matrix, which we can think of as 333 points in R4\mathbb{R}^4. We, of course, can’t visualize in R4\mathbb{R}^4. One solution is to draw scatter plots of all pairs of numerical features. Points are colored by species.

Image produced in Jupyter

While we have four features (or “independent” variables) for each penguin, it seems that some of the features are strongly correlated, in the sense that the correlation coefficient

r=1ni=1n(xixˉσx)(yiyˉσy)r = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{\sigma_x} \right) \left( \frac{y_i - \bar{y}}{\sigma_y} \right)

between them is close to ±1\pm 1. (To be clear, the coloring of the points is by species, which does not impact the correlations between features.)

penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr()
Loading...

For instance, the correlation between flipper_length_mm and body_mass_g is about 0.87, which suggests that penguins with longer flippers tend to be heavier. Not only are correlated features redundant, but they can cause interpretability issues - that is, multicollinearity - when using them in a linear regression model, since the resulting optimal parameters w\vec w^* will no longer be interpretable. (Remember, in h(xi)=wAug(xi)h(\vec x_i) = \vec w \cdot \text{Aug}(\vec x_i), w\vec w^* contains the optimal coefficients for each feature, which measure rates of change when the other features are held constant. But if two features are correlated, one can’t be held constant without also changing the other.)

Here’s the key idea: instead of having to work with d=4d = 4 features, perhaps we can instead construct some number k<dk < d of new features that:

  1. capture the “most important” information in the dataset,

  2. while being uncorrelated (i.e. linearly independent) from one another.

I’m not suggesting that we just select some of the original features and drop the others; rather, I’m proposing that we find 1 or 2 new features that are linear combinations of the original features. If everything works out correctly, we may be able to reduce the number of features we need to deal with, without losing too much information, and without the interpretability issues that come with multicollinearity.

To illustrate what I mean by constructing a new feature, let’s suppose we’re starting with a 2-dimensional dataset, which we’d like to reduce to 1 dimension. Consider the scatter plot below of flipper_length_mm vs. body_mass_g (with colors removed).

Image produced in Jupyter

Above, each penguin is represented by two numbers, which we can think of as a vector

xi=[body massiflipper lengthi]\vec x_i = \begin{bmatrix} \text{body mass}_i \\ \text{flipper length}_i \end{bmatrix}

The fact that each point is a vector is not immediately intuitive given the plot above: I want you to imagine drawing arrows from the origin to each point.

Now, suppose these vectors xi\vec x_i are in the rows of a matrix XX, i.e.

X=[body mass1flipper length1body mass2flipper length2body massnflipper lengthn]X = \begin{bmatrix} \text{body mass}_1 & \text{flipper length}_1 \\ \text{body mass}_2 & \text{flipper length}_2 \\ \vdots & \vdots \\ \text{body mass}_n & \text{flipper length}_n \end{bmatrix}

Our goal is to construct a single new feature, called a principal component, by taking a linear combination of both existing features. Principal component analysis, or PCA, is the process of finding the best principal components (new features) for a dataset.

new featurei=abody massi+bflipper lengthi\text{new feature}_i = a \cdot \text{body mass}_i + b \cdot \text{flipper length}_i

This is equivalent to approximating this 2-dimensional dataset with a single line. (The equation above is a re-arranged version of y=mx+by = mx + b, where xx is body massi\text{body mass}_i and yy is flipper lengthi\text{flipper length}_i.)

What is the best linear combination of the features, or equivalently, what is the best line to approximate this 2-dimensional dataset? To make this idea precise, we need to define what we mean by “best”. It may be tempting to think that we should find the best-fitting line that predicts flipper_length_mm from body_mass_g, but that’s not quite what we want, since this is not a prediction problem: it’s a representation problem. Regression lines in the way we’ve seen them before are fit by minimizing mean squared vertical error, yih(xi)y_i - h(x_i), but don’t at all capture horizontal errors, i.e. errors in the body_mass_g direction here.

So instead, we’d like to find the line such that, for our nn vectors (points) x1,,xn\vec x_1, \ldots, \vec x_n, the mean squared distance to the closest vector (point) on the line is minimized. Crucially, this distance is calculated by computing the norm of the difference between the two vectors. If pi\vec p_i is the closest vector (point) on the line to xi\vec x_i, then we’d like to minimize

1ni=1nxipi2\frac{1}{n} \sum_{i=1}^n \lVert \vec x_i - \vec p_i \rVert^2

But, as we saw in Chapter 3.4, the vector (point) on a line that is closest to another vector (point) xi\vec x_i is the orthogonal projection of xi\vec x_i onto that line. The resulting error vectors are orthogonal to the line we’re projecting on. As such, I’ll say that we’re looking for the line that minimizes the mean squared orthogonal projection error.

Image produced in Jupyter

The difference between the two lines above is key. The top line reflects the supervised nature of linear regression, where the goal is to make accurate predictions. The bottom line reflects the fact that we’re looking for a new representation of the data that captures the most important information. These, in general, are not the same line, since they’re chosen to minimize different objective functions.

Recall from Chapter 1 that unsupervised learning is the process of finding structure in unlabeled data, without any specific target variable. This is precisely the problem we’re trying to solve here. To further distinguish our current task from the problem of finding regression lines, I’ll now refer to finding the line with minimal orthogonal projection error as finding the best direction.

Activity 1


Minimizing Orthogonal Error

The first section above told us that our goal is to find the best direction, which is the direction that minimizes the mean squared orthogonal error of projections onto it. How do we find this direction?

Fortunately, we spent all of Chapter 3.4 and Chapter 6.3 discussing orthogonal projections. But there, we discussed how to project one vector onto the span of one or more other given vectors (e.g. projecting the observation vector y\vec y onto the span of the columns of a design matrix XX). That’s not quite we’re doing here, because here, we don’t know which vector we’re projecting onto: we’re trying to find the best vector to project our n=333n = 333 other vectors onto!

Suppose that in general, XX is an n×dn \times d matrix, whose rows xi\vec x_i are data points in Rd\mathbb{R}^d. (In our current example, d=2d = 2, but everything we’re studying applies in higher dimensions.) Then,

  • The orthogonal projection of xi\vec x_i onto v\vec v is

    pi=xivvvv\vec p_i = \frac{\vec x_i \cdot \vec v}{\vec v \cdot \vec v} \vec v
  • To keep the algebra slightly simpler, if we assume that v\vec v is a unit vector (meaning v=vv=1\lVert \vec v \rVert = \sqrt{\vec v \cdot \vec v} = 1), then the orthogonal projection of xi\vec x_i onto v\vec v is

    pi=(xiv)v\vec p_i = (\vec x_i \cdot \vec v) \vec v

    This is a reasonable choice to make, since unit vectors are typically used to describe directions.

  • If vRd\vec v \in \mathbb{R}^d, then span({v})\text{span}(\{ \vec v \}) is a line through the origin in Rd\mathbb{R}^d.

Centering the Data

The issue is that our dataset - and most datasets in general, by default - aren’t centered at the origin, but the projections pi=(xiv)v\vec p_i = (\vec x_i \cdot \vec v) \vec v all lie on the line through the origin spanned by v\vec v. So, trying to approximate our dataset using a line through the origin is not a good idea.

Image produced in Jupyter
Image produced in Jupyter

So, the solution is to first center the data by subtracting the mean of each column from each entry in that column. This gives us a new matrix X~\tilde X.

X=[x1(1)x1(2)x1(d)x2(1)x2(2)x2(d)xn(1)xn(2)xn(d)]X~=[x1(1)μ1x1(2)μ2x1(d)μdx2(1)μ1x2(2)μ2x2(d)μdxn(1)μ1xn(2)μ2xn(d)μd]X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d)} \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d)} \end{bmatrix} \rightarrow \tilde X = \begin{bmatrix} x_1^{(1)} - \mu_1 & x_1^{(2)} - \mu_2 & \cdots & x_1^{(d)} - \mu_d \\ x_2^{(1)} - \mu_1 & x_2^{(2)} - \mu_2 & \cdots & x_2^{(d)} - \mu_d \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} - \mu_1 & x_n^{(2)} - \mu_2 & \cdots & x_n^{(d)} - \mu_d \end{bmatrix}

In X~\tilde X, the mean of each column is 0, which means that it’s reasonable to approximate the dataset using a line through the origin. Importantly, this doesn’t change the shape of the dataset, meaning it doesn’t change the “best direction” we’re searching for.

Activity 2

Finding the Best Direction

Now, X~\tilde X is the n×dn \times d matrix of centered data, and we’re ready to return to our goal of finding the best direction. Let x~iRd\tilde x_i \in \mathbb{R}^d refer to row ii of X~\tilde X. Note that I’ve dropped the vector symbol on top of x~i\tilde x_i, not because it’s not a vector, but just because drawing a tilde on top of a vector hat is a little funky: x~i\vec{\tilde x_i}.

X~=[x1(1)μ1x1(2)μ2x1(d)μdx2(1)μ1x2(2)μ2x2(d)μdxn(1)μ1xn(2)μ2xn(d)μd]=[x~1Tx~2Tx~nT]\tilde X = \begin{bmatrix} x_1^{(1)} - \mu_1 & x_1^{(2)} - \mu_2 & \cdots & x_1^{(d)} - \mu_d \\ x_2^{(1)} - \mu_1 & x_2^{(2)} - \mu_2 & \cdots & x_2^{(d)} - \mu_d \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} - \mu_1 & x_n^{(2)} - \mu_2 & \cdots & x_n^{(d)} - \mu_d \end{bmatrix} = \begin{bmatrix} - & \tilde x_1^T & - \\ - & \tilde x_2^T - \\ & \vdots & \\ - & \tilde x_n^T & - \end{bmatrix}

Our goal is to find the unit vector vRd\vec v \in \mathbb{R}^d that minimizes the mean squared orthogonal projection error. That is, if the orthogonal projection of x~i\tilde x_i onto v\vec v is pi\vec p_i, then we want to find the unit vector v\vec v that minimizes

J(v)=1ni=1nx~ipi2=1ni=1nx~i(x~iv)v2J(\vec v) = \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - \vec p_i \rVert^2 = \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - (\tilde x_i \cdot \vec v) \vec v \rVert^2

I’ve chosen the letter JJ since there is no standard notation for this quantity, and because JJ is a commonly used letter in optimization problems. It’s important to note that JJ is not a loss function, nor is it an empirical risk, since it has nothing to do with prediction.

How do we find the unit vector v\vec v that minimizes J(v)J(\vec v)? We can start by expanding the squared norm, using the fact that ab2=a22ab+b2\lVert \vec a - \vec b \rVert^2 = \lVert \vec a \rVert^2 - 2 \vec a \cdot \vec b + \lVert \vec b \rVert^2:

J(v)=1ni=1nx~i(x~iv)v2=1ni=1n(x~i22x~i(x~iv)v+((x~iv)v)((x~iv)v))=1ni=1n(x~i22x~i(x~iv)v+(x~iv)2vv=1)=1ni=1n(x~i22(x~iv)2+(x~iv)2)=1ni=1n(x~i2(x~iv)2)=1ni=1nx~i21ni=1n(x~iv)2\begin{align*} J(\vec v) &= \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - (\tilde x_i \cdot \vec v) \vec v \rVert^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 \tilde x_i \cdot (\tilde x_i \cdot \vec v) \vec v + \left( (\tilde x_i \cdot \vec v) \vec v \right) \cdot \left( (\tilde x_i \cdot \vec v) \vec v \right) \right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 \tilde x_i \cdot (\tilde x_i \cdot \vec v) \vec v + (\tilde x_i \cdot \vec v)^2 \underbrace{\vec v \cdot \vec v}_{=1}\right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 (\tilde x_i \cdot \vec v)^2 + (\tilde x_i \cdot \vec v)^2 \right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - (\tilde x_i \cdot \vec v)^2 \right) \\ &= \frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i \rVert^2 - \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2 \end{align*}

The first term, 1ni=1nx~i2\frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i \rVert^2, is constant with respect to v\vec v, meaning we can ignore it for the purposes of minimizing J(v)J(\vec v). Our focus, then, is on minimizing the second term,

1ni=1n(x~iv)2- \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

But, since the second term has a negative sign in front of it, minimizing it is equivalent to maximizing its positive counterpart,

1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

Maximizing Variance

Finding the best direction is equivalent to finding the unit vector v\vec v that maximizes the mean squared value of x~iv\tilde x_i \cdot \vec v. What is this quantity, though? Remember, the projection of x~i\tilde x_i onto v\vec v is given by

pi=(x~iv)v\vec p_i = (\tilde x_i \cdot \vec v) \vec v

All nn of our original data points are projected onto v\vec v, meaning that each x~i\tilde x_i is approximated by some scalar multiple (x~iv\tilde x_i \cdot \vec v) of v\vec v. These scalar multiples are our new feature values - that is, the values of the first principal component (first because this is the “best” direction).

Image produced in Jupyter

The values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v - 2.62\color{orange}-2.62, 1.49\color{orange}-1.49, 1.14\color{orange}-1.14, 1.11\color{orange}-1.11, and 2.11\color{orange}2.11 above - describe the position of point x~i\tilde x_i on the line defined by v\vec v. (I’ve highlighted just five of the points to make the picture less cluttered, but in reality every single blue point has a corresponding orange dot above!)

These coefficients on v\vec v can be thought of as positions on a number line.

Image produced in Jupyter

Effectively, we’ve rotated the data so that the orange points are on the “new” xx-axis. (Rotations... where have we heard of those before? The connection is coming.)

So, we have a grasp of what the values x~1v,x~2v,,x~nv{\color{orange}\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v} describe. But why are they important?

The quantity we’re trying to maximize,

1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

is the variance of the values x~1v,x~2v,,x~nv\color{orange}\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v. It’s not immediately obvious why this is the case. Remember, the variance of a list of numbers is

1ni=1n(valueimean)2\frac{1}{n} \sum_{i = 1}^n (\text{value}_i - \text{mean})^2

(I’m intentionally using vague notation to avoid conflating singular values with standard deviations.)

The quantity 1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2 resembles this definition of variance, but it doesn’t seem to involve a subtraction by a mean. This is because the mean of x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v is 0! This seems to check out visually above: it seems that if we found the corresponding orange point for every single blue point, the orange points would be symmetrically distributed around 0.

Since the mean of the values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v is 0, it is true that the quantity we’re maximizing is the variance of the values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v, which I’ll refer to as

PV(v)=1ni=1n(x~iv)2\text{PV}(\vec v) = \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

PV stands for “projected variance”; it is the variance of the scalar projections of the rows of X~\tilde X onto v\vec v. This is non-standard notation, but I wanted to avoid calling this the “variance of v\vec v”, since it’s not the variance of the components of v\vec v.

The Duality

We’ve stumbled upon a beautiful duality - that is, two equivalent ways to think about the same problem. Given a centered n×dn \times d matrix X~\tilde X with rows x~iRd\tilde x_i \in \mathbb{R}^d, the most important direction v\vec v is the one that

  1. Minimizes the mean squared orthogonal error of projections onto it,

    J(v)=1ni=1nx~ipi2J(\vec v) = \frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i - \vec p_i \rVert^2
  2. Equivalently, maximizes the variance of the projected values x~iv\tilde x_i \cdot \vec v, i.e. maximizes

PV(v)=1ni=1n(x~iv)2\text{PV}(\vec v) = \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

Minimizing orthogonal error is equivalent to maximizing variance: don’t forget this. In the widget below, drag the slider to see this duality in action: the shorter the orthogonal errors tend to be, the more spread out the orange points tend to be (meaning their variance is large). When the orange points are close together, the orthogonal errors tend to be quite large.

Loading...

Note that the scalars x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v are just the components of the vector X~v\tilde X \vec v! Multiplying X~\tilde X by v\vec v returns a vector that contains the dot products of each row of X~\tilde X (which are the x~i\tilde x_i’s) with v\vec v. This means that an equivalent formula for the projected variance is

PV(v)=1nX~v2\text{PV}(\vec v) = \frac{1}{n} \lVert \tilde X \vec v \rVert^2

This equivalent formulation will allow us to employ gradient rules from Chapter 8.2 to find the best direction.

The Rayleigh Quotient

Our optimization problem is

maximize X~v2subject to v=1\text{maximize } \lVert \tilde X \vec v \rVert^2 \quad \text{subject to } \lVert \vec v \rVert = 1

I’ve dropped the constant factor of 1n\frac{1}{n} for convenience; it won’t change the answer.

This is a constrained optimization problem. The constraint matters because we’re trying to find a direction, not an arbitrarily long vector: if we were allowed to scale v\vec v however we wanted, then X~v2\lVert \tilde X \vec v \rVert^2 could be made as large as we wanted just by multiplying v\vec v by a huge constant.

Constrained optimization problems are usually harder than unconstrained ones. In an unconstrained problem, a standard strategy is to compute a gradient and set it equal to 0\vec 0. With a constraint like v=1\lVert \vec v \rVert = 1, that simple strategy does not apply directly, since the maximizer of X~v2\lVert \tilde X \vec v \rVert^2 subject to v=1\lVert \vec v \rVert = 1 might not have a zero gradient.

Fortunately, this particular constrained problem can be converted into an equivalent unconstrained one. Define

f(v)=X~v2v2f(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2}

for all non-zero vectors v\vec v. This function is called a Rayleigh quotient, as we discussed briefly in Chapter 9.6.

Why is maximizing f(v)f(\vec v) with no constraints equivalent to maximizing X~v2\lVert \tilde X \vec v \rVert^2 subject to v=1\lVert \vec v \rVert = 1? The short answer is that f(v)f(\vec v) is invariant to scaling v\vec v by a constant factor. To see what I mean, suppose that cc is a non-zero constant. Then,

f(cv)=X~(cv)2cv2=c2X~v2c2v2=f(v)f(c\vec v) = \frac{\lVert \tilde X (c\vec v) \rVert^2}{\lVert c\vec v \rVert^2} = \frac{c^2 \lVert \tilde X \vec v \rVert^2}{c^2 \lVert \vec v \rVert^2} = f(\vec v)

So f(v)f(\vec v) depends only on direction, not on length. Equivalently, every non-zero v\vec v gives the same value as the corresponding unit vector vv\frac{\vec v}{\lVert \vec v \rVert}, who makes the denominator in f(v)f(\vec v) equal to 1. That means maximizing X~v2\lVert \tilde X \vec v \rVert^2 subject to v=1\lVert \vec v \rVert = 1 is equivalent to maximizing f(v)f(\vec v) over all non-zero vectors v\vec v.

This is easier, because once the constraint has been absorbed into the denominator, we can treat ff as an ordinary vector-to-scalar function and find its critical points by setting the gradient equal to 0\vec 0.

In our setting,

f(v)=X~v2v2=vTX~TX~vvTvf(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2} = \frac{\vec v^T \tilde X^T \tilde X \vec v}{\vec v^T \vec v}

To find the critical points of ff, it is cleaner to rewrite it as a product:

f(v)=(X~v2)(v2)f(\vec v) = \left( \lVert \tilde X \vec v \rVert^2 \right) \left( \lVert \vec v \rVert^{-2} \right)

Let

g(v)=X~v2=vTX~TX~vh(v)=v2=(vTv)1g(\vec v) = \lVert \tilde X \vec v \rVert^2 = \vec v^T \tilde X^T \tilde X \vec v \\ h(\vec v) = \lVert \vec v \rVert^{-2} = (\vec v^T \vec v)^{-1}

so that f(v)=g(v)h(v)f(\vec v) = g(\vec v) h(\vec v). Since X~TX~\tilde X^T \tilde X is symmetric,

g(v)=2X~TX~v\nabla g(\vec v) = 2 \tilde X^T \tilde X \vec v

Also, by the chain rule,

h(v)=2vv4\nabla h(\vec v) = -2 \frac{\vec v}{\lVert \vec v \rVert^4}

Now the product rule gives

f(v)=g(v)h(v)+h(v)g(v)\nabla f(\vec v) = {\color{3d81f6} g(\vec v)} {\color{orange} \nabla h(\vec v)} + {\color{d81a60} h(\vec v)} {\color{004d40} \nabla g(\vec v)}

which simplifies to

f(v)=g(v)h(v)+h(v)g(v)=vTX~TX~v(2vv4)+(vTv)1(2X~TX~v)=2vTX~TX~vv4v+2X~TX~vv2=2f(v)v2v+2X~TX~vv2=2v2(X~TX~vf(v)v)\begin{align*} \nabla f(\vec v) &= {\color{3d81f6} g(\vec v)} {\color{orange} \nabla h(\vec v)} + {\color{d81a60} h(\vec v)} {\color{004d40} \nabla g(\vec v)} \\ &= {\color{3d81f6} \vec v^T \tilde X^T \tilde X \vec v} {\color{orange} \left(-2 \frac{\vec v}{\lVert \vec v \rVert^4}\right)} + {\color{d81a60} (\vec v^T \vec v)^{-1}} {\color{004d40} \left(2 \tilde X^T \tilde X \vec v\right)} \\ &= -2 \frac{\vec v^T \tilde X^T \tilde X \vec v}{\lVert \vec v \rVert^4} \vec v + 2 \frac{\tilde X^T \tilde X \vec v}{\lVert \vec v \rVert^2} \\ &= -2 \frac{f(\vec v)}{\lVert \vec v \rVert^2} \vec v + 2 \frac{\tilde X^T \tilde X \vec v}{\lVert \vec v \rVert^2} \\ &= \frac{2}{\lVert \vec v \rVert^2} \left( \tilde X^T \tilde X \vec v - f(\vec v) \vec v \right) \end{align*}

If v\vec v maximizes f(v)f(\vec v), then it must be a critical point, so f(v)=0\nabla f(\vec v) = \vec 0. Therefore,

X~TX~vf(v)v=0\tilde X^T \tilde X \vec v - f(\vec v) \vec v = \vec 0

which means

X~TX~v=f(v)v\tilde X^T \tilde X \vec v = f(\vec v) \vec v

This says exactly that v\vec v is an eigenvector of X~TX~\tilde X^T \tilde X, and that its eigenvalue is f(v)f(\vec v). If v\vec v maximizes – or minimizes – f(v)f(\vec v), then it must be an eigenvector of X~TX~\tilde X^T \tilde X.

So, to make f(v)f(\vec v) as large as possible, we should choose the eigenvector whose eigenvalue is as large as possible. In other words, the best direction is the eigenvector of X~TX~\tilde X^T \tilde X corresponding to its largest eigenvalue!

What does this really mean geometrically?

Image produced in Jupyter

It says that the line that maximizes the variance of the projected data - which is also the line that minimizes the mean squared orthogonal error - is spanned by the eigenvector of X~TX~\tilde X^T \tilde X corresponding to the largest eigenvalue of X~TX~\tilde X^T \tilde X.

So, we’ve solved the “best direction” problem. But what can we do with it? And how does this relate to the singular value decomposition, which we discussed in Chapters 10.1 and 10.2 but didn’t touch at all here? Let’s keep reading...