Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

10.3. The Best Direction

Introduction

As we saw in Chapter 10.2, the singular value decomposition of a matrix XX into X=UΣVTX = U \Sigma V^T can be used to find the best low-rank approximation of XX, which has applications in image compression, for example. But, hiding in the singular value decomposition is another, more data-driven application: dimensionality reduction.

Consider, for example, the dataset below. Each row corresponds to a penguin.

Loading...

Each penguin has three categorical features - species, island, and sex - and four numerical features - bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. If we restrict our attention to these numerical features, we have a 333×4333 \times 4 matrix, which we can think of as 333 points in R4\mathbb{R}^4. We, of course, can’t visualize in R4\mathbb{R}^4. One solution is to draw scatter plots of all pairs of numerical features. Points are colored by species.

Image produced in Jupyter

While we have four features (or “independent” variables) for each penguin, it seems that some of the features are strongly correlated, in the sense that the correlation coefficient

r=1ni=1n(xixˉσx)(yiyˉσy)r = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{\sigma_x} \right) \left( \frac{y_i - \bar{y}}{\sigma_y} \right)

between them is close to ±1\pm 1. (To be clear, the coloring of the points is by species, which does not impact the correlations between features.)

penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr()
Loading...

For instance, the correlation between flipper_length_mm and body_mass_g is about 0.87, which suggests that penguins with longer flippers tend to be heavier. Not only are correlated features redundant, but they can cause interpretability issues - that is, multicollinearity - when using them in a linear regression model, since the resulting optimal parameters w\vec w^* will no longer be interpretable. (Remember, in h(xi)=wAug(xi)h(\vec x_i) = \vec w \cdot \text{Aug}(\vec x_i), w\vec w^* contains the optimal coefficients for each feature, which measure rates of change when the other features are held constant. But if two features are correlated, one can’t be held constant without also changing the other.)

Here’s the key idea: instead of having to work with d=4d = 4 features, perhaps we can instead construct some number k<dk < d of new features that:

  1. capture the “most important” information in the dataset,

  2. while being uncorrelated (i.e. linearly independent) from one another.

I’m not suggesting that we just select some of the original features and drop the others; rather, I’m proposing that we find 1 or 2 new features that are linear combinations of the original features. If everything works out correctly, we may be able to reduce the number of features we need to deal with, without losing too much information, and without the interpretability issues that come with multicollinearity.

To illustrate what I mean by constructing a new feature, let’s suppose we’re starting with a 2-dimensional dataset, which we’d like to reduce to 1 dimension. Consider the scatter plot below of flipper_length_mm vs. body_mass_g (with colors removed).

Image produced in Jupyter

Above, each penguin is represented by two numbers, which we can think of as a vector

xi=[body massiflipper lengthi]\vec x_i = \begin{bmatrix} \text{body mass}_i \\ \text{flipper length}_i \end{bmatrix}

The fact that each point is a vector is not immediately intuitive given the plot above: I want you to imagine drawing arrows from the origin to each point.

Now, suppose these vectors xi\vec x_i are in the rows of a matrix XX, i.e.

X=[body mass1flipper length1body mass2flipper length2body massnflipper lengthn]X = \begin{bmatrix} \text{body mass}_1 & \text{flipper length}_1 \\ \text{body mass}_2 & \text{flipper length}_2 \\ \vdots & \vdots \\ \text{body mass}_n & \text{flipper length}_n \end{bmatrix}

Our goal is to construct a single new feature, called a principal component, by taking a linear combination of both existing features. Principal component analysis, or PCA, is the process of finding the best principal components (new features) for a dataset.

new featurei=abody massi+bflipper lengthi\text{new feature}_i = a \cdot \text{body mass}_i + b \cdot \text{flipper length}_i

This is equivalent to approximating this 2-dimensional dataset with a single line. (The equation above is a re-arranged version of y=mx+by = mx + b, where xx is body massi\text{body mass}_i and yy is flipper lengthi\text{flipper length}_i.)

What is the best linear combination of the features, or equivalently, what is the best line to approximate this 2-dimensional dataset? To make this idea precise, we need to define what we mean by “best”. It may be tempting to think that we should find the best-fitting line that predicts flipper_length_mm from body_mass_g, but that’s not quite what we want, since this is not a prediction problem: it’s a representation problem. Regression lines in the way we’ve seen them before are fit by minimizing mean squared vertical error, yih(xi)y_i - h(x_i), but don’t at all capture horizontal errors, i.e. errors in the body_mass_g direction here.

So instead, we’d like to find the line such that, for our nn vectors (points) x1,,xn\vec x_1, \ldots, \vec x_n, the mean squared distance to the closest vector (point) on the line is minimized. Crucially, this distance is calculated by computing the norm of the difference between the two vectors. If pi\vec p_i is the closest vector (point) on the line to xi\vec x_i, then we’d like to minimize

1ni=1nxipi2\frac{1}{n} \sum_{i=1}^n \lVert \vec x_i - \vec p_i \rVert^2

But, as we saw in Chapter 3.4, the vector (point) on a line that is closest to another vector (point) xi\vec x_i is the orthogonal projection of xi\vec x_i onto that line. The resulting error vectors are orthogonal to the line we’re projecting on. As such, I’ll say that we’re looking for the line that minimizes the mean squared orthogonal projection error.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 155
    148 fig.add_trace(go.Scatter(
    149     x=x_pca_line, y=y_pca_line, mode='lines',
    150     line=dict(color='orange', width=3),
    151     name='First Principal Component', legendgroup='lines', showlegend=True
    152 ), row=2, col=1)
    154 # 4. Orthogonal Errors
--> 155 for error_trace in pca_errors:
    156     fig.add_trace(error_trace, row=2, col=1)
    158 # --- APPLY STYLING: Equal aspect ratio for both axes in row 2 ---

NameError: name 'pca_errors' is not defined

The difference between the two lines above is key. The top line reflects the supervised nature of linear regression, where the goal is to make accurate predictions. The bottom line reflects the fact that we’re looking for a new representation of the data that captures the most important information. These, in general, are not the same line, since they’re chosen to minimize different objective functions.

Recall from Chapter 1 that unsupervised learning is the process of finding structure in unlabeled data, without any specific target variable. This is precisely the problem we’re trying to solve here. To further distinguish our current task from the problem of finding regression lines, I’ll now refer to finding the line with minimal orthogonal projection error as finding the best direction.


Minimizing Orthogonal Error

The first section above told us that our goal is to find the best direction, which is the direction that minimizes the mean squared orthogonal error of projections onto it. How do we find this direction?

Fortunately, we spent all of Chapter 3.4 and Chapter 6.3 discussing orthogonal projections. But there, we discussed how to project one vector onto the span of one or more other given vectors (e.g. projecting the observation vector y\vec y onto the span of the columns of a design matrix XX). That’s not quite we’re doing here, because here, we don’t know which vector we’re projecting onto: we’re trying to find the best vector to project our n=333n = 333 other vectors onto!

Suppose that in general, XX is an n×dn \times d matrix, whose rows xi\vec x_i are data points in Rd\mathbb{R}^d. (In our current example, d=2d = 2, but everything we’re studying applies in higher dimensions.) Then,

  • The orthogonal projection of xi\vec x_i onto v\vec v is

    pi=xivvvv\vec p_i = \frac{\vec x_i \cdot \vec v}{\vec v \cdot \vec v} \vec v
  • To keep the algebra slightly simpler, if we assume that v\vec v is a unit vector (meaning v=vv=1\lVert \vec v \rVert = \sqrt{\vec v \cdot \vec v} = 1), then the orthogonal projection of xi\vec x_i onto v\vec v is

    pi=(xiv)v\vec p_i = (\vec x_i \cdot \vec v) \vec v

    This is a reasonable choice to make, since unit vectors are typically used to describe directions.

  • If vRd\vec v \in \mathbb{R}^d, then span({v})\text{span}(\{\vec v\}) is a line through the origin in Rd\mathbb{R}^d.

Centering the Data

The issue is that our dataset - and most datasets in general, by default - aren’t centered at the origin, but the projections pi=(xiv)v\vec p_i = (\vec x_i \cdot \vec v) \vec v all lie on the line through the origin spanned by v\vec v. So, trying to approximate our dataset using a line through the origin is not a good idea.

Image produced in Jupyter
Image produced in Jupyter

So, the solution is to first center the data by subtracting the mean of each column from each entry in that column. This gives us a new matrix X~\tilde X.

X=[x1(1)x1(2)x1(d)x2(1)x2(2)x2(d)xn(1)xn(2)xn(d)]X~=[x1(1)μ1x1(2)μ2x1(d)μdx2(1)μ1x2(2)μ2x2(d)μdxn(1)μ1xn(2)μ2xn(d)μd]X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d)} \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d)} \end{bmatrix} \rightarrow \tilde X = \begin{bmatrix} x_1^{(1)} - \mu_1 & x_1^{(2)} - \mu_2 & \cdots & x_1^{(d)} - \mu_d \\ x_2^{(1)} - \mu_1 & x_2^{(2)} - \mu_2 & \cdots & x_2^{(d)} - \mu_d \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} - \mu_1 & x_n^{(2)} - \mu_2 & \cdots & x_n^{(d)} - \mu_d \end{bmatrix}

In X~\tilde X, the mean of each column is 0, which means that it’s reasonable to approximate the dataset using a line through the origin. Importantly, this doesn’t change the shape of the dataset, meaning it doesn’t change the “best direction” we’re searching for.

Finding the Best Direction

Now, X~\tilde X is the n×dn \times d matrix of centered data, and we’re ready to return to our goal of finding the best direction. Let x~iRd\tilde x_i \in \mathbb{R}^d refer to row ii of X~\tilde X. Note that I’ve dropped the vector symbol on top of x~i\tilde x_i, not because it’s not a vector, but just because drawing a tilde on top of a vector hat is a little funky: x~i\vec{\tilde x_i}.

X~=[x1(1)μ1x1(2)μ2x1(d)μdx2(1)μ1x2(2)μ2x2(d)μdxn(1)μ1xn(2)μ2xn(d)μd]=[x~1Tx~2Tx~nT]\tilde X = \begin{bmatrix} x_1^{(1)} - \mu_1 & x_1^{(2)} - \mu_2 & \cdots & x_1^{(d)} - \mu_d \\ x_2^{(1)} - \mu_1 & x_2^{(2)} - \mu_2 & \cdots & x_2^{(d)} - \mu_d \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} - \mu_1 & x_n^{(2)} - \mu_2 & \cdots & x_n^{(d)} - \mu_d \end{bmatrix} = \begin{bmatrix} - & \tilde x_1^T & - \\ - & \tilde x_2^T - \\ & \vdots & \\ - & \tilde x_n^T & - \end{bmatrix}

Our goal is to find the unit vector vRd\vec v \in \mathbb{R}^d that minimizes the mean squared orthogonal projection error. That is, if the orthogonal projection of x~i\tilde x_i onto v\vec v is pi\vec p_i, then we want to find the unit vector v\vec v that minimizes

J(v)=1ni=1nx~ipi2=1ni=1nx~i(x~iv)v2J(\vec v) = \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - \vec p_i \rVert^2 = \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - (\tilde x_i \cdot \vec v) \vec v \rVert^2

I’ve chosen the letter JJ since there is no standard notation for this quantity, and because JJ is a commonly used letter in optimization problems. It’s important to note that JJ is not a loss function, nor is it an empirical risk, since it has nothing to do with prediction.

How do we find the unit vector v\vec v that minimizes J(v)J(\vec v)? We can start by expanding the squared norm, using the fact that ab2=a22ab+b2\lVert \vec a - \vec b \rVert^2 = \lVert \vec a \rVert^2 - 2 \vec a \cdot \vec b + \lVert \vec b \rVert^2:

J(v)=1ni=1nx~i(x~iv)v2=1ni=1n(x~i22x~i(x~iv)v+((x~iv)v)((x~iv)v))=1ni=1n(x~i22x~i(x~iv)v+(x~iv)2vv=1)=1ni=1n(x~i22(x~iv)2+(x~iv)2)=1ni=1n(x~i2(x~iv)2)=1ni=1nx~i21ni=1n(x~iv)2\begin{align*} J(\vec v) &= \frac{1}{n}\sum_{i = 1}^n \lVert \tilde x_i - (\tilde x_i \cdot \vec v) \vec v \rVert^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 \tilde x_i \cdot (\tilde x_i \cdot \vec v) \vec v + \left( (\tilde x_i \cdot \vec v) \vec v \right) \cdot \left( (\tilde x_i \cdot \vec v) \vec v \right) \right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 \tilde x_i \cdot (\tilde x_i \cdot \vec v) \vec v + (\tilde x_i \cdot \vec v)^2 \underbrace{\vec v \cdot \vec v}_{=1}\right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - 2 (\tilde x_i \cdot \vec v)^2 + (\tilde x_i \cdot \vec v)^2 \right) \\ &= \frac{1}{n} \sum_{i=1}^n \left( \lVert \tilde x_i \rVert^2 - (\tilde x_i \cdot \vec v)^2 \right) \\ &= \frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i \rVert^2 - \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2 \end{align*}

The first term, 1ni=1nx~i2\frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i \rVert^2, is constant with respect to v\vec v, meaning we can ignore it for the purposes of minimizing J(v)J(\vec v). Our focus, then, is on minimizing the second term,

1ni=1n(x~iv)2- \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

But, since the second term has a negative sign in front of it, minimizing it is equivalent to maximizing its positive counterpart,

1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

Maximizing Variance

Finding the best direction is equivalent to finding the unit vector v\vec v that maximizes the mean squared value of x~iv\tilde x_i \cdot \vec v. What is this quantity, though? Remember, the projection of x~i\tilde x_i onto v\vec v is given by

pi=(x~iv)v\vec p_i = (\tilde x_i \cdot \vec v) \vec v

All nn of our original data points are projected onto v\vec v, meaning that each x~i\tilde x_i is approximated by some scalar multiple (x~iv\tilde x_i \cdot \vec v) of v\vec v. These scalar multiples are our new feature values - that is, the values of the first principal component (first because this is the “best” direction).

Image produced in Jupyter

The values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v - 2.62\color{orange}-2.62, 1.49\color{orange}-1.49, 1.14\color{orange}-1.14, 1.11\color{orange}-1.11, and 2.11\color{orange}2.11 above - describe the position of point x~i\tilde x_i on the line defined by v\vec v. (I’ve highlighted just five of the points to make the picture less cluttered, but in reality every single blue point has a corresponding orange dot above!)

These coefficients on v\vec v can be thought of as positions on a number line.

Image produced in Jupyter

Effectively, we’ve rotated the data so that the orange points are on the “new” xx-axis. (Rotations... where have we heard of those before? The connection is coming.)

So, we have a grasp of what the values x~1v,x~2v,,x~nv{\color{orange}\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v} describe. But why are they important?

The quantity we’re trying to maximize,

1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

is the variance of the values x~1v,x~2v,,x~nv\color{orange}\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v. It’s not immediately obvious why this is the case. Remember, the variance of a list of numbers is

1ni=1n(valueimean)2\frac{1}{n} \sum_{i = 1}^n (\text{value}_i - \text{mean})^2

(I’m intentionally using vague notation to avoid conflating singular values with standard deviations.)

The quantity 1ni=1n(x~iv)2\frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2 resembles this definition of variance, but it doesn’t seem to involve a subtraction by a mean. This is because the mean of x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v is 0! This seems to check out visually above: it seems that if we found the corresponding orange point for every single blue point, the orange points would be symmetrically distributed around 0.

Since the mean of the values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v is 0, it is true that the quantity we’re maximizing is the variance of the values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v, which I’ll refer to as

PV(v)=1ni=1n(x~iv)2\text{PV}(\vec v) = \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

PV stands for “projected variance”; it is the variance of the scalar projections of the rows of X~\tilde X onto v\vec v. This is non-standard notation, but I wanted to avoid calling this the “variance of v\vec v”, since it’s not the variance of the components of v\vec v.

The Duality

We’ve stumbled upon a beautiful duality - that is, two equivalent ways to think about the same problem. Given a centered n×dn \times d matrix X~\tilde X with rows x~iRd\tilde x_i \in \mathbb{R}^d, the most important direction v\vec v is the one that

  1. Minimizes the mean squared orthogonal error of projections onto it,

    J(v)=1ni=1nx~ipi2J(\vec v) = \frac{1}{n} \sum_{i = 1}^n \lVert \tilde x_i - \vec p_i \rVert^2
  2. Equivalently, maximizes the variance of the projected values x~iv\tilde x_i \cdot \vec v, i.e. maximizes

PV(v)=1ni=1n(x~iv)2\text{PV}(\vec v) = \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

Minimizing orthogonal error is equivalent to maximizing variance: don’t forget this. In the widget below, drag the slider to see this duality in action: the shorter the orthogonal errors tend to be, the more spread out the orange points tend to be (meaning their variance is large). When the orange points are close together, the orthogonal errors tend to be quite large.

Loading...

Here’s a related StackExchange post with a similar animation.

Minimizing mean squared orthogonal error is a useful goal: this brings our 1-dimensional approximation of the data as close as possible to the original data. Why is maximizing variance another useful goal?

Here’s an analogy: suppose I had a dataset with several characteristics about each student in this course, and I needed to come up with a single number to describe each student. Perhaps I’d take some weighted combination of the number of lectures they’ve attended, their Midterm 1 and Midterm 2 scores, and the number of office hours they’ve attended. I wouldn’t factor in the number of semesters they’ve been at Michigan, since this number is likely to be similar across all students, while these other characteristics vary more significantly.

The guiding philosophy here is that variance is information. I want to come up with a linear combination of features that remains as different as possible across different data points, as this will allow me to distinguish between different data points, i.e. most faithfully represent the data, while being in a lower-dimensional space.

Optimization Constraints

Okay, so we know that minimizing mean squared orthogonal error is equivalent to maximizing variance. Presumably, this interpretation will help us in our quest of finding the most important direction, v\vec v.

Again, we want to maximize

PV(v)=1ni=1n(x~iv)2\text{PV}(\vec v) = \frac{1}{n} \sum_{i = 1}^n (\tilde x_i \cdot \vec v)^2

PV(v)\text{PV}(\vec v) is a vector-to-scalar function; it takes in vectors in Rd\mathbb{R}^d and returns a scalar. So, to maximize it, we’d need to take its gradient with respect to v\vec v. (The same would have been true if we wanted to minimize the mean squared orthogonal error - we’d need to have taken the gradient of J(v)J(\vec v). The point of highlighting the duality above, in addition to helping build intuition, was to address the fact that J(v)J(\vec v) doesn’t have a simple gradient, but PV(v)\text{PV}(\vec v) does.)

Is there a way we can write PV(v)\text{PV}(\vec v) in a way that would allow us to apply the gradient rules we developed in Chapter 8? Using the observation that the values x~1v,x~2v,,x~nv\tilde x_1 \cdot \vec v, \tilde x_2 \cdot \vec v, \ldots, \tilde x_n \cdot \vec v are all components of the matrix-vector product X~v\tilde X \vec v, we can rewrite PV(v)\text{PV}(\vec v) as

PV(v)=1nX~v2\text{PV}(\vec v) = \frac{1}{n} \lVert \tilde X \vec v \rVert^2

which can further be simplified to

PV(v)=1nvTX~TX~v\text{PV}(\vec v) = \frac{1}{n} \vec v^T \tilde X^T \tilde X \vec v

Now, we can apply the gradient rules in Chapter 8 - specifically, the quadratic form rule - to find PV(v)\nabla \text{PV}(\vec v). But, there’s something we’ve forgotten. Let’s take a step back.

Remember, v\vec v must be a unit vector! All of the optimization problems we’ve seen so far in this course have been unconstrained, meaning that there were never restrictions on what the optimal solutions could be. This is our first time encountering a constrained optimization problem. We can write this optimization problem as

max 1nX~v2subject to v=1\begin{aligned} \text{max} \quad\ & \frac{1}{n} \lVert \tilde X \vec v \rVert^2 \\ \text{subject to} \quad\ & \lVert \vec v \rVert = 1 \end{aligned}

If we don’t constrain v\vec v to be a unit vector, then PV(v)\text{PV}(\vec v) is unbounded: the larger I make the components in v\vec v, the larger PV(v)\text{PV}(\vec v) becomes.

One general solution for solving constrained optimization problems is to use the method Lagrange multipliers, which are typically covered in a Calculus 3 course. We won’t cover this approach here. Instead, we’ll use another approach that works for this specific problem. Instead of maximizing

PV(v)=1nX~v2\text{PV}(\vec v) = \frac{1}{n} \lVert \tilde X \vec v \rVert^2

we’ll maximize a related function that automatically constrains v\vec v to be a unit vector, without us having to write a constraint.

f(v)=X~v2v2f(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2}

Why is maximizing f(v)f(\vec v) equivalent to maximizing PV(v)\text{PV}(\vec v) with the unit vector constraint?

  • First, note that the factor of 1n\frac{1}{n} from PV(v)\text{PV}(\vec v) has been dropped: maximizing 1nX~v2\frac{1}{n} \lVert \tilde X \vec v \rVert^2 is equivalent to maximizing X~v2\lVert \tilde X \vec v \rVert^2. The same vector wins in both cases. So, for convenience, we’ll drop the 1n\frac{1}{n} for now.

  • Now, suppose v\vec v is a unit vector. Then, v2=1\lVert \vec v \rVert^2 = 1, so f(v)=X~v2f(\vec v) = \lVert \tilde X \vec v \rVert^2.

  • If v\vec v is not a unit vector, then v=cu\vec v = c \vec u, where u\vec u is some unit vector. But then

    f(v)=X~v2v2=X~(cu)2cu2=cX~u2c2u2=c2X~u2c2u2=X~u2u2=f(u)f(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2} = \frac{\lVert \tilde X (c \vec u) \rVert^2}{\lVert c \vec u \rVert^2} = \frac{\lVert c \tilde X \vec u \rVert^2}{c^2 \lVert \vec u \rVert^2} = \frac{c^2 \lVert \tilde X \vec u \rVert^2}{c^2 \lVert \vec u \rVert^2} = \frac{\lVert \tilde X \vec u \rVert^2}{\lVert \vec u \rVert^2} = f(\vec u)

    which means that f(v)=f(u)f(\vec v) = f(\vec u), i.e. f(v)f(\vec v) is equal to ff applied to a unit vector with the same direction as v\vec v.

Dividing by v2\lVert \vec v \rVert^2 effectively neutralizes the effect of the length of v\vec v and instead focuses on the direction of v\vec v, which is precisely what we’re looking for.

The Rayleigh Quotient

Now, our problem is to find the vector v\vec v that maximizes

f(v)=X~v2v2f(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2}

Here’s the big idea: we’ve seen this function before! Specifically, in Homework 9, Problem 4, you studied the Rayleigh quotient, which is the function

g(v)=vTAvvTvg(\vec v) = \frac{\vec v^T A \vec v}{\vec v^T \vec v}

for a symmetric n×nn \times n matrix AA. There, you showed that the gradient of g(v)g(\vec v) is given by

g(v)=2vTv(Avg(v)v)\nabla g(\vec v) = \frac{2}{\vec v^T \vec v} \left( A \vec v - g(\vec v) \vec v \right)

Which vector v\vec v maximizes g(v)g(\vec v)? If v\vec v maximizes g(v)g(\vec v), then g(v)=0\nabla g(\vec v) = \vec 0.

2vTv(Avg(v)v)=0    Avg(v)v=0    Av=g(v)v\frac{2}{\vec v^T \vec v} \left( A \vec v - g(\vec v) \vec v \right) = \vec 0 \implies A \vec v - g(\vec v) \vec v = \vec 0 \implies A \vec v = g(\vec v) \vec v

What this is saying is that if g(v)=0\nabla g(\vec v) = \vec 0, then v\vec v is an eigenvector of AA!

As an example - and this is a total diversion from the main storyline of the lecture, but it’s an important one - let’s visualize the Rayleigh quotient for the symmetric 2×22 \times 2 matrix

A=[1551]A = \begin{bmatrix} 1 & 5 \\ 5 & 1 \end{bmatrix}

If we let v=[xy]\vec v = \begin{bmatrix} x \\ y\end{bmatrix}, then this Rayleigh quotient is

g(v)=vTAvvTv=x2+10xy+y2x2+y2g(\vec v) = \frac{\vec v^T A \vec v}{\vec v^T \vec v} = \frac{x^2 + 10xy + y^2}{x^2 + y^2}

I have visualized it as a contour plot, a concept we first encountered in Chapter 8.1.

Loading...

Dark red regions correspond to the highest values of the Rayleigh quotient, and dark blue regions correspond to the lowest values. You’ll notice that the darkest red and darkest blue regions form lines: these lines are the directions of the eigenvectors of AA! The corresponding outputs along those lines are the eigenvalues of AA. Take a look at the scale on the colorbar: the graph maxes out at 6, which is the largest eigenvalue of AA, and its minimum is -4, which is the smallest eigenvalue of AA.

So, the fact that

g(v)=0    Av=g(v)v\nabla g(\vec v) = \vec 0 \implies A \vec v = g(\vec v) \vec v

means that:

  • the vector that maximizes g(v)=vTAvvTvg(\vec v) = \frac{\vec v^T A \vec v}{\vec v^T \vec v} is the eigenvector of AA corresponding to the largest eigenvalue of AA, and

  • the vector that minimizes g(v)g(\vec v) is the eigenvector of AA corresponding to the smallest eigenvalue of AA!

How does this relate to our problem, of finding the vector that maximizes PV(v)\text{PV}(\vec v), which is the same vector that maximizes

f(v)=X~v2v2f(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2}

?

Here’s the key: Our f(v)f(\vec v) is a Rayleigh quotient, where A=X~TX~A = \tilde X^T \tilde X.

f(v)=X~v2v2=vTX~TX~vvTvf(\vec v) = \frac{\lVert \tilde X \vec v \rVert^2}{\lVert \vec v \rVert^2} = \frac{\vec v^T {\color{#3d81f6} \tilde X^T \tilde X} \vec v}{\vec v^T \vec v}

X~TX~\tilde X^T \tilde X is a symmetric matrix, so everything we know about the Rayleigh quotient applies. That is,

f(v)=2vTv(X~TX~vf(v)v)\nabla f(\vec v) = \frac{2}{\vec v^T \vec v} \left( \tilde X^T \tilde X \vec v - f(\vec v) \vec v \right)

and

f(v)=0    X~TX~v=f(v)vif v is a critical point, it must be an eigenvector of X~TX~\underbrace{\nabla f(\vec v) = \vec 0 \implies \tilde X^T \tilde X \vec v = f(\vec v) \vec v}_{\text{if } \vec v \text{ is a critical point, it must be an eigenvector of } \tilde X^T \tilde X}

Using the same reasoning as before, the vector that maximizes f(v)f(\vec v) is the eigenvector of X~TX~\tilde X^T \tilde X corresponding to the largest eigenvalue of X~TX~\tilde X^T \tilde X!

What does this really mean?

Image produced in Jupyter

It says that the line that maximizes the variance of the projected data - which is also the line that minimizes the mean squared orthogonal error - is spanned by the eigenvector of X~TX~\tilde X^T \tilde X corresponding to the largest eigenvalue of X~TX~\tilde X^T \tilde X.

Next, we use this best-direction result to define principal components and connect them back to the SVD.