Skip to article frontmatterSkip to article content

4.1. The Gradient Vector

The big theme in our course so far has been the three-step modeling recipe, of choosing a model, choosing a loss function, and then minimizing empirical risk (i.e. average loss) to find optimal model parameters.

In Chapter 1.4, we used calculus to find the slope w1w_1^* and intercept w0w_0^* that minimized mean squared error,

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

by computing Rsqw0\frac{\partial R_\text{sq}}{\partial w_0} (the partial derivative with respect to w0w_0) Rsqw1\frac{\partial R_\text{sq}}{\partial w_1}, setting both to zero, and solving the resulting system of equations.

Then, in Chapters 2 and 3, we focused on the multiple linear regression model and the squared loss function, which saw us minimize

Rsq(w)=1nyXw2=1ni=1n(yiwAug(xi))2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \vec w \cdot \text{Aug}(\vec x_i))^2

where XX is the n×(d+1)n \times (d + 1) design matrix, yRn\vec y \in \mathbb{R}^n is the observation vector, and wRd+1\vec w \in \mathbb{R}^{d+1} is the parameter vector we’re trying to pick. In Chapter 2.10, we minimized Rsq(w)R_\text{sq}(\vec w) by arguing that the optimal w\vec w^* had to create an error vector, e=yXw\vec e = \vec y - X \vec w^*, that was orthogonal to colsp(X)\text{colsp}(X), which led us to the normal equations.

It turns out that there’s a way to use our calculus-based approach from Chapter 1.4 to minimize the more general version of RsqR_\text{sq} for any dd, that doesn’t involve computing dd partial derivatives. To see how this works, we need to define a new object, the gradient vector, which we’ll do here in Chapter 4.1. After we’re familiar with how the gradient vector works, we’ll use it to build a new approach to function minimization, one that works even when there isn’t a closed-form solution for the optimal parameters: that technique is called gradient descent, which we’ll see in Chapter 4.2.


Domain and Codomain

As we saw in Chapter 2.9 when we first introduced the concept of the inverse of a matrix, the notation

f:RdRnf: \mathbb{R}^d \to \mathbb{R}^n

means that ff is a function whose inputs are vectors with dd components and whose outputs are vectors with nn components. Rd\mathbb{R}^d is the domain of the function, and Rn\mathbb{R}^n is the codomain. I’ve used dd and nn to match the notation we’ve used for matrices and linear transformations. In general, if AA is an n×dn \times d matrix, then any vector x\vec x multiplied by AA (on the right) must be in Rd\mathbb{R}^d and the result AxA \vec x will be in Rn\mathbb{R}^n.

Given this framing, consider the following four types of functions.

TypeDomain and CodomainExamples
Scalar-to-scalarf:RRf: \mathbb{R} \to \mathbb{R}f(x)=x2+sin(x)f(x) = x^2 + \sin(x)
Rsq(w)=1ni=1n(yiw)2R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2
Vector-to-scalarf:RdRf: \mathbb{R}^d \to \mathbb{R}f(x)=xTxf(\vec x) = \vec x^T \vec x
f(x)=(x11)2+(x1x2)2+3f(\vec x) = (x_1 - 1)^2 + (x_1 - x_2)^2 + 3
Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2
Scalar-to-vectorf:RRnf: \mathbb{R} \to \mathbb{R}^nf(x)=[121]+x[304]parametric form of a linef(x) = \underbrace{\begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} + x \begin{bmatrix} 3 \\ 0 \\ -4 \end{bmatrix}}_{\text{parametric form of a line}}

f(x)=[x2+ex3]f(x) = \begin{bmatrix} x^2 + e^{x} \\ -3 \end{bmatrix}
Vector-to-vectorf:RdRnf: \mathbb{R}^d \to \mathbb{R}^nf(x)=[342001]xlinear transformation\underbrace{f(\vec x) = \begin{bmatrix} 3 & 4 \\ 2 & 0 \\ 0 & 1 \end{bmatrix} \vec x}_{\text{linear transformation}}

f(x)=[x12+x1x2+cos(x14)3x1x2+45x1x2/x3]f(\vec x) = \begin{bmatrix} x_1^2 + x_1x_2 + \cos(x_1^4) \\ 3x_1x_2 + 4 \\ 5x_1 \\ x_2 / x_3 \end{bmatrix}

The first two types of functions are “scalar-valued”, while the latter two are “vector-valued”. These are not the only types of functions that exist; for instance, the function f(A)=rank(A)f(A) = \text{rank}(A) is a matrix-to-scalar function.

The type of function we’re most concerned with at the moment are vector-to-scalar functions, i.e. functions that take in a vector (or equivalently, multiple scalar inputs) and output a single scalar.

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

is one such function, and it’s the focus of this section.


Rates of Change

Let’s think from the perspectives of rates of change, since ultimately what we’re building towards is a technique for minimizing functions. We’re most familiar with the concept of rates of change for scalar-to-scalar functions.

If

f(x)=x2sin(x)f(x) = x^2 \sin(x)

then its derivative,

dfdx=2xsin(x)+x2cos(x)\frac{\text{d}f}{\text{d}x} = 2x \sin(x) + x^2 \cos(x)

itself is a scalar-to-scalar function, which describes how quickly ff is changing at any point xx in the domain of ff. At x=3x = 3, for instance, the instantaneous rate of change is

dfdx(3)=23sin(3)+32cos(3)8.06\frac{\text{d}f}{\text{d}x}(3) = 2\cdot 3 \sin(3) + 3^2 \cos(3) \approx -8.06

meaning that at x=3x = 3, ff is decreasing at a rate of (approximately) 8.06 per unit change in xx. Perhaps a more intuitive way of thinking about the instantaneous rate of change is to think of it as the slope of the tangent line to ff at x=3x = 3.

Loading...

The steeper the slope, the faster ff is changing at that point; the sign of the slope tells us whether ff is increasing or decreasing at that point.

In Chapter 1.4, we saw how to compute derivatives of functions that take in multiple scalar inputs, like

f(x,y,z)=x2+2xy+3xz+4(yz)2f(x, y, z) = x^2 + 2xy + 3xz + 4(y - z)^2

In the language of Chapter 4.1, we’d call such a function a vector-to-scalar function, and might use the notation

f(x)=x12+2x1x2+3x1x3+4(x2x3)2f(\vec x) = x_1^2 + 2x_1x_2 + 3x_1x_3 + 4(x_2 - x_3)^2

This function has three partial derivatives, each of which describes the instantaneous rate of change of ff with respect to one of its inputs, while holding the other two inputs constant. There’s a good animation of what it means to hold an input constant in Chapter 1.4 that is worth revisiting.

Here,

fx1=2x1+2x2+3x3,fx2=2x1+8x28x3,fx3=3x18x2+8x3\frac{\partial f}{\partial x_1} = 2x_1 + 2x_2 + 3x_3, \quad \frac{\partial f}{\partial x_2} = 2x_1 + 8x_2 - 8x_3, \quad \frac{\partial f}{\partial x_3} = 3x_1 - 8x_2 + 8x_3

The big idea of this section, the gradient vector, packages all of these partial derivatives into a single vector. This will allow us to think about the direction in which ff is changing, rather than just looking at its rates of change in each dimension independently.


The Gradient Vector

As usual, we’ll start with an example. Suppose xR2x \in \mathbb{R}^2, and let

f(x)=x1ex12x22f(\vec x) = x_1 e^{-x_1^2 - x_2^2}
Loading...

To find f(x)\nabla f(\vec x), we need to compute the partial derivatives of ff with respect to each component of x\vec x. The “input variables” to ff are x1x_1 and x2x_2, so we need to compute fx1\frac{\partial f}{\partial x_1} and fx2\frac{\partial f}{\partial x_2}, but if you’d like, replace x1x_1 and x2x_2 with xx and yy if it makes the algebra a little cleaner, and then replace xx and yy with x1x_1 and x2x_2 at the end.

f(x)=x1ex12x22f(\vec x) = x_1 e^{-x_1^2 - x_2^2}
fx1=x1(x1ex12x22)=1ex12x22+x1ex12x22(2x1)product rule=(12x12)ex12x22\frac{\partial f}{\partial x_1} = \frac{\partial}{\partial x_1} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{1 \cdot e^{-x_1^2 - x_2^2} + x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_1)}_{\text{product rule}} = (1 - 2x_1^2) e^{-x_1^2 - x_2^2}
fx2=x2(x1ex12x22)=x1ex12x22(2x2)chain rule=2x1x2ex12x22\frac{\partial f}{\partial x_2} = \frac{\partial}{\partial x_2} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_2)}_{\text{chain rule}} = -2x_1 x_2 e^{-x_1^2 - x_2^2}

Putting these together, we have

f(x)=[(12x12)ex12x222x1x2ex12x22]\nabla f(\vec x) = \begin{bmatrix} (1 - 2x_1^2) e^{-x_1^2 - x_2^2} \\ -2x_1 x_2 e^{-x_1^2 - x_2^2} \end{bmatrix}

Remember, f(x)\nabla f(\vec x) itself is a function. If we plug in a value of x\vec x, we get a new vector back.

f([10])=[(12(1)2)e(1)2022(1)(0)e(1)202]=[1/e0]\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} (1 - 2(-1)^2) e^{-(-1)^2 - 0^2} \\ -2(-1)(0) e^{-(-1)^2 - 0^2} \end{bmatrix} = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}

What does f([10])=[1/e0]\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix} really tell us? In order to visualize it, let me introduce another way of visualizing ff, called a contour plot.

Loading...

I think of the contour plot as a birds-eye view 🦅 of ff when you look at it from above. when you look at the surface from above. Notice the correspondence between the colors in both graphs.

The circle-like traces in the contour plot are called level curves; they represent slices through the surface at a constant height. On the right, the circle labeled 0.1 represents the set of points where f(x1,x2)=0.1f(x_1, x_2) = 0.1.

Visualizing the fact that f([10])=[1/e0]\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix} is easier to do in the contour plot, since the contour plot is 2-dimensional, like the gradient vector is. Remember that red values are high and blue values are low.

Loading...

At the point [10]\begin{bmatrix} -1 \\ 0 \end{bmatrix}, which is at the tail of the vector drawn in gold, ff is near the global minimum, meaning there are lots of directions in which we can move to increase ff. But, the gradient vector at this point is [1/e0]\begin{bmatrix} -1/e \\ 0 \end{bmatrix}, which points in the direction of steepest ascent starting at [10]\begin{bmatrix} -1 \\ 0 \end{bmatrix}. The gradient describes the “quickest way up”.

As another example, consider the fact that f([1.250.5])[0.3470.204]\nabla f\left(\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix}\right) \approx \begin{bmatrix} -0.347 \\ 0.204 \end{bmatrix}.

Loading...

Again, the gradient at [1.250.5]\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix} gives us the direction in which ff is increasing the quickest at that very point. If we move even a little bit in any direction (in the direction of the gradient or some other direction), the gradient will change.

As you might guess, to find the critical points of a function – that is, places where it is neither increasing nor decreasing – we need to find points where the gradient is zero. Hold that thought.


Examples

More typically, the functions we’ll need to take the gradient of will themselves be defined in terms of matrix and vector operations. In all of these examples, remember that we’re working with vector-to-scalar functions.

Example: Dot Product

Let aRn\vec a \in \mathbb{R}^n be some fixed vector (the equivalent of a constant in this context). Let’s find the gradient of

f(x)=axf(\vec x) = \vec a \cdot \vec x

I find it helpful to think about f(x)f(\vec x) in its expanded form,

f(x)=ax=a1x1+a2x2++anxnf(\vec x) = \vec a \cdot \vec x = a_1 x_1 + a_2 x_2 + \cdots + a_n x_n

Remember, f(x)\nabla f(\vec x) contains all of the partial derivatives of ff, which we now need to compute.

  • What is fx1\frac{\partial f}{\partial x_1}? To me, that looks like a1a_1, since the first term is a1x1a_1 x_1 and none of the other terms involve x1x_1.

  • Similarly, fx2=a2\frac{\partial f}{\partial x_2} = a_2.

  • In general, fxi=ai\frac{\partial f}{\partial x_i} = a_i.

Putting these together, we get

f(x)=[fx1fx2fxn]=[a1a2an]=a\nabla f(\vec x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = \vec a

Example: Norm and Chain Rule

Here’s an extremely important example that shows up everywhere in machine learning. Find the gradients of:

  1. f(x)=x2f(\vec x) = \lVert \vec x \rVert^2

  2. f(x)=xf(\vec x) = \lVert \vec x \rVert

Example: Norm to an Exponent

Find the gradient of f(x)=xpf(\vec x) = \lVert \vec x \rVert^p, where pp is some real number.

Example: Log Sum Exp

If xRn\vec x \in \mathbb{R}^n, we can define the log sum exp function as

f(x)=log(i=1nexi)f(\vec x) = \log \left( \sum_{i=1}^n e^{x_i} \right)

What is f(x)\nabla f(\vec x)? (The answer is called the softmax function, and comes up all the time in machine learning, when we want our models to output predicted probabilities in a classification problem.)

Example: Quadratic Forms

Suppose xRnx \in \mathbb{R}^n and AA is an n×nn \times n matrix. The function

f(x)=xTAxf(\vec x) = \vec x^T A \vec x

is called a quadratic form, and its gradient is given by

f(x)=(A+AT)x\nabla f(\vec x) = (A + A^T) \vec x

We won’t directly cover the proof of this formula here; one place to find it is here. Instead, we’ll focus our energy on understanding how it works, since it’s extremely important.

  1. Let A=[abcd]A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}. Expand out f(x)=xTAxf(\vec x) = \vec x^T A \vec x and compute f(x)\nabla f(\vec x) directly by computing partial derivatives, and verify that the result you get matches the formula above.

  2. In quadratic forms, we typically assume that AA is symmetric, meaning that A=ATA = A^T. Why do you think this assumption is made (what does it help with)?

    • Hint: Let A=[3261]A = \begin{bmatrix} 3 & 2 \\ 6 & 1 \end{bmatrix} and B=[3441]B = \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix}. Compute (xTAx)\nabla (\vec x^T A \vec x) and (xTBx)\nabla (\vec x^T B \vec x).

  3. If AA is any symmetric n×nn \times n matrix, what is f(x)\nabla f(\vec x)?

  4. Suppose AA is symmetric and n×nn \times n, bRn\vec b \in \mathbb{R}^n, and cRc \in \mathbb{R}. Find the gradient of

    f(x)=xTAx+bx+cf(\vec x) = \vec x^T A \vec x + \vec b \cdot \vec x + c

Summary of Important Gradient Rules

These are the core rules you need to know moving forward, not just because we’re about to use them in an important proof, but because they’ll come up repeatedly in your future machine learning work.

FunctionNameGradient
f(x)=axf(\vec x) = \vec a \cdot \vec xdot productf(x)=a\nabla f(\vec x) = \vec a
f(x)=x2f(\vec x) = \lVert \vec x \rVert^2squared normf(x)=2x\nabla f(\vec x) = 2\vec x
f(x)=xTAxf(\vec x) = \vec x^T A \vec xquadratic formf(x)=(A+AT)x\nabla f(\vec x) = (A + A^T) \vec x
if AA is symmetric, f(x)=2Ax\nabla f(\vec x) = 2A \vec x

Optimization

In the calculus of scalar-to-scalar functions, we have a well-understood procedure for finding the extrema of a function. The general strategy is to take the derivative, set it to zero, and solve for the inputs (called critical points) that satisfy that condition. To be thorough, we’d perform a second derivative test to check whether each critical point is a maximum, minimum, or neither.

In the land of vector-to-scalar functions, the equivalent is to solve for where the gradient is zero, which corresponds to finding where all partial derivatives are zero. Assessing whether we’ve arrived at a maximum or minimum is more difficult to do in the vector-to-scalar case, and we will save a discussion of this for Chapter 4.2.

As an example, consider

f(x)=xT[3441]x+[12]x+3f(\vec x) = \vec x^T \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix} \vec x + \begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \vec x + 3

As we computed earlier, the gradient of f(x)=xTAx+bx+cf(\vec x) = \vec x^T A \vec x + \vec b \cdot \vec x + c is f(x)=2Ax+b\nabla f(\vec x) = 2A \vec x + \vec b for symmetric AA. So,

f(x)=2[3441]x+[12]=[6x1+8x2+18x1+2x2+2]\nabla f(\vec x) = 2 \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix} \vec x + \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} 6x_1 + 8x_2 + 1 \\ 8x_1 + 2x_2 + 2 \end{bmatrix}

To find the critical points, we set the gradient to zero and solve the resulting system. We can also accomplish this by using the inverse of AA, if we happen to have it:

f(x)=0    2Ax+b=0    x=12A1b\nabla f(\vec x) = 0 \implies 2 A \vec x + \vec b = 0 \implies \vec x^* = -\frac{1}{2}A^{-1} \vec b

Either way, we find that x=[7/261/13]\vec x^* = \begin{bmatrix} -7/26 \\ 1/13 \end{bmatrix} satisfies f(x)=0\nabla f(\vec x^*) = 0, which corresponds to a local minimum.

Loading...

Minimizing Mean Squared Error

Remember, the goal of this section is to minimize mean squared error,

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

In the general case, XX is an n×(d+1)n \times (d + 1) matrix, yRny \in \mathbb{R}^n, and wRd+1\vec w \in \mathbb{R}^{d+1}.

We’re now equipped with the tools to minimize Rsq(w)R_\text{sq}(\vec w) by taking its gradient and setting it to zero. Hopefully, we end up with the same conditions on w\vec w^* that we derived in Chapter 2.10.

In the most recent example we saw, the optimal vector x\vec x^* corresponded to a local minimum. We know that we won’t run into such an issue here since Rsq(w)R_\text{sq}(\vec w) cannot output a negative number (it is the average of squared losses), so its minimum possible output is 0, meaning that there will be some global minimizer w\vec w^*.

Let’s start by rewriting the squared norm as a dot product and eventually matrix multiplication.

Rsq(w)=1nyXw2=1n(yXw)(yXw)=1n(yXx)T(yXw)since uv=uTv=1n(yT(Xw)T)(yXw)=1n(yTyyTXw(Xw)Ty+(Xw)TXw)\begin{align*}R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 &= \frac{1}{n} (\vec y - X \vec w) \cdot (\vec y - X \vec w) \\ &= \underbrace{\frac{1}{n} (\vec y - X \vec x)^T (\vec y - X \vec w)}_{\text{since } \vec u \cdot \vec v = \vec u^T \vec v} \\ &= \frac{1}{n} \left( \vec y^T - (X \vec w)^T \right) (\vec y - X \vec w) \\ &= \frac{1}{n} \left( \vec y^T \vec y - {\color{orange}\vec y^T X \vec w} - {\color{orange}(X \vec w)^T \vec y} + (X \vec w)^T X \vec w \right)\end{align*}

Let’s focus on the two terms in orange. They are both equal: they are both the dot product of y\vec y and XwX \vec w. Ideally, I want to express each term as a dot product of w\vec w with something, since I’m taking the gradient with respect to w\vec w. Remember, the dot product is a scalar, and the transpose of a scalar is just that same scalar. So,

yTXw=(yTXw)T=wTXTy=wT(XTy)\vec y^T X \vec w = (\vec y^T X \vec w)^T = \vec w^T X^T \vec y = \vec w^T (X^T \vec y)

so, performing this substitution in for both orange terms gives us

Rsq(w)=1n(yTywT(XTy)wTXTy+wT(XTX)w)=1n(yTy2wT(XTy)+wT(XTX)w)\begin{align*}R_\text{sq}(\vec w) &= \frac{1}{n} \left( \vec y^T \vec y - {\color{orange}\vec w^T (X^T \vec y)} - {\color{orange}\vec w^T X^T \vec y} + \vec w^T (X^T X) \vec w \right) \\ &= \frac{1}{n} \left( \vec y^T \vec y - 2 \vec w^T (X^T \vec y) + \vec w^T (X^T X) \vec w \right)\end{align*}

Now, we’re ready to take the gradient, which we’ll do term by term.

  • (yTy)=0\nabla \left( \vec y^T \vec y \right) = \vec 0, since yTy\vec y^T \vec y is a constant with respect to w\vec w

  • (2wT(XTy))=2XTy\nabla \left( 2 \vec w^T (X^T \vec y) \right) = 2 X^T \vec y using the dot product rule, since this is the dot product between 2XTy2X^T \vec y (a vector) and w\vec w (a vector)

  • (wT(XTX)w)=2XTXw\nabla \left( \vec w^T (X^T X) \vec w \right) = 2X^T X \vec w, using the quadratic form rule, since XTXX^T X is a symmetric matrix

Plugging these terms in gives us

Rsq(w)=1n(yTy2wT(XTy)+wT(XTX)w)Rsq(w)=1n((yTy)(2wT(XTy))+(wT(XTX)w))=1n(02XTy+2XTXw)=2n(XTXwXTy)\begin{align*}R_\text{sq}(\vec w) &= \frac{1}{n} \left( \vec y^T \vec y - 2 \vec w^T (X^T \vec y) + \vec w^T (X^T X) \vec w \right) \\ \nabla R_\text{sq}(\vec w) &= \frac{1}{n} \left( \nabla \left(\vec y^T \vec y \right) - \nabla \left( 2 \vec w^T (X^T \vec y) \right) + \nabla \left( \vec w^T (X^T X) \vec w \right) \right) \\ &= \frac{1}{n} \left( 0 - 2 X^T \vec y + 2X^T X \vec w \right) \\ &= \boxed{\frac{2}{n} (X^T X \vec w - X^T \vec y)} \end{align*}

Finally, to find the minimizer w\vec w^*, we set the gradient to zero and solve.

2n(XTXwXTy)=0    XTXw=XTy\frac{2}{n} (X^T X \vec w^* - X^T \vec y) = 0 \implies X^TX \vec w^* = X^T \vec y

Stop me if this feels familiar... these are the normal equations once again! It shouldn’t be a surprise that we ended up with the same conditions on w\vec w^* that we derived in Chapter 2.10, since we were solving the same problem.

We’ve now shown that the minimizer of

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

is given by solving XTXw=XTyX^TX \vec w^* = X^T \vec y. These equations have a unique solution if XTXX^TX is invertible, and infinitely many solutions otherwise. If w\vec w^* satisfies the normal equations, then XwX \vec w^* is the vector in colsp(X)\text{colsp}(X) that is closest to y\vec y. All of that interpretation from Chapter 2.10 and Chapter 3 carry over; we’ve just introduced a new way of finding the solution.

Heads up: In Homework 9, you’ll follow similar steps to minimize a new objective function, that resembles Rsq(w)R_\text{sq}(\vec w) but involves another term. There, you’ll minimize

Rridge(w)=yXw2+λw2R_\text{ridge}(\vec w) = \lVert \vec y - X \vec w \rVert^2 + \lambda \lVert \vec w \rVert^2

where λ>0\lambda > 0 is a constant, called the regularization hyperparameter. (Notice the missing 1n\frac{1}{n}.) A good way to practice what you’ve learned (and to get a head start on the homework) is to compute the gradient of Rridge(w)R_\text{ridge}(\vec w) and set it to zero. We’ll walk through what the significance of Rridge(w)R_\text{ridge}(\vec w) is in the homework.