More typically, the functions we’ll need to take the gradient of will themselves be defined in terms of matrix and vector operations. In all of these examples, remember that we’re working with vector-to-scalar functions.
The Big Three Rules¶
Example: Dot Product¶
Let be some fixed vector (the equivalent of a constant in this context). Let’s find the gradient of
I find it helpful to think about in its expanded form,
Remember, contains all of the partial derivatives of , which we now need to compute.
What is ? To me, that looks like , since the first term is and none of the other terms involve .
Similarly, .
In general, .
Putting these together, we get
To visualize this, let . Then
so the graph of is the plane in . The two arrows below show the gradient at two different points on the plane. They point in the same direction because everywhere, so the direction of steepest ascent doesn’t change.
Example: Norm and Chain Rule¶
Here’s an extremely important example that shows up everywhere in machine learning. Find the gradients of:
Chain Rule for Vector-to-Scalar Functions¶
To recap from the previous example, suppose , where is a scalar-to-scalar function (meaning has a derivative) and is a vector-to-scalar function (meaning has a gradient). Then,
Example: Norm to an Exponent¶
Find the gradient of , where is some real number.
Solution
We can treat this as a composition of two functions, and , and use the chain rule introduced in the solution to the previous example.
and . Putting these together yields
Example: Log Sum Exp¶
If , we can define the log sum exp function as
What is ? (The answer is called the softmax function, and comes up all the time in machine learning, when we want our models to output predicted probabilities in a classification problem.)
Solution
Let’s look at the partial derivatives with respect to each .
Then,
There isn’t really a way to simplify the expression using matrix-vector operations, so I’ll leave it as-is. As mentioned above, the gradient we’re looking at is called the softmax function. The softmax function maps , meaning it’s a vector-to-vector function.
Let’s suppose we have the matrix . What does passing it through the softmax function yield?
The output vector has the same number of elements as the input vector, but each element is between 0 and 1, and the sum of elements is 1, meaning that we can interpret the outputted vector as a probability distribution. Larger values in the output correspond to larger values in the input, and almost all of the “mass” is concentrated at the maximum element of the input vector (position 2), hence the name “soft” max. (The “hard” max might be in this case.)
Example: Quadratic Forms¶
Suppose and is an matrix. The function
is called a quadratic form, and its gradient is given by
We won’t directly cover the proof of this formula here; one place to find it is here. Instead, we’ll focus our energy on understanding how it works, since it’s extremely important.
Let . Expand out and compute directly by computing partial derivatives, and verify that the result you get matches the formula above.
In quadratic forms, we typically assume that is symmetric, meaning that . Why do you think this assumption is made (what does it help with)?
Hint: Let and . Compute and .
If is any symmetric matrix, what is ?
Suppose is symmetric and , , and . Find the gradient of
Solution
If , then
Then,
since , meaning .
For a particular quadratic form, there are infinitely many choices of matrices that represent it. To illustrate, let’s look at and as provided in the hint.
Note that both and are equal to the expression . In fact, any matrix of the form where would produce the same quadratic form.
So, to avoid this issue of having infinitely many choices of the matrix , we pick the symmetric matrix , where . As we’re about to see, this choice of simplifies the calculation of the gradient.
If is any symmetric matrix, then , and . So,
This is also an important rule; don’t forget it.
Think of as the matrix-vector equivalent of a quadratic function, . The derivative of is . Check out what the gradient of ends up being!
Summary of the Big Three Rules¶
These are the main three rules you need to know moving forward, not just because we’re about to use them in an important proof, but because they’ll come up repeatedly in your future machine learning work.
| Function | Name | Gradient |
|---|---|---|
| dot product | ||
| squared norm | ||
| quadratic form | if is symmetric, |
Optimization¶
In the calculus of scalar-to-scalar functions, we have a well-understood procedure for finding the extrema of a function. The general strategy is to take the derivative, set it to zero, and solve for the inputs (called critical points) that satisfy that condition. To be thorough, we’d perform a second derivative test to check whether each critical point is a maximum, minimum, or neither.
In the land of vector-to-scalar functions, the equivalent is to solve for where the gradient is zero, which corresponds to finding where all partial derivatives are zero. Assessing whether we’ve arrived at a maximum or minimum is more difficult to do in the vector-to-scalar case, and we will save a discussion of this for Chapter 8.5.
As an example, consider
As we computed earlier, the gradient of is for symmetric . So,
To find the critical points, we set the gradient to zero and solve the resulting system. We can also accomplish this by using the inverse of , if we happen to have it:
Either way, we find that satisfies , which corresponds to a local minimum.
Minimizing Mean Squared Error¶
Remember, the goal of this section is to minimize mean squared error,
In the general case, is an matrix, , and .
We’re now equipped with the tools to minimize by taking its gradient and setting it to zero. Hopefully, we end up with the same conditions on that we derived in Chapter 6.3.
In the most recent example we saw, the optimal vector corresponded to a local minimum. We know that we won’t run into such an issue here since cannot output a negative number (it is the average of squared losses), so its minimum possible output is 0, meaning that there will be some global minimizer .
Let’s start by rewriting the squared norm as a dot product and eventually matrix multiplication.
Let’s focus on the two terms in orange. They are both equal: they are both the dot product of and . Ideally, I want to express each term as a dot product of with something, since I’m taking the gradient with respect to . Remember, the dot product is a scalar, and the transpose of a scalar is just that same scalar. So,
so, performing this substitution in for both orange terms gives us
Now, we’re ready to take the gradient, which we’ll do term by term.
, since is a constant with respect to
using the dot product rule, since this is the dot product between (a vector) and (a vector)
, using the quadratic form rule, since is a symmetric matrix
Plugging these terms in gives us
Finally, to find the minimizer , we set the gradient to zero and solve.
Stop me if this feels familiar... these are the normal equations once again! It shouldn’t be a surprise that we ended up with the same conditions on that we derived in Chapter 6.3, since we were solving the same problem.
We’ve now shown that the minimizer of
is given by solving . These equations have a unique solution if is invertible, and infinitely many solutions otherwise. If satisfies the normal equations, then is the vector in that is closest to . All of that interpretation from Chapter 6.3 and Chapter 7 carry over; we’ve just introduced a new way of finding the solution.
Heads up: In Homework 10, you’ll follow similar steps to minimize a new objective function, that resembles but involves another term. There, you’ll minimize
where is a constant, called the regularization hyperparameter. (Notice the missing .) A good way to practice what you’ve learned (and to get a head start on the homework) is to compute the gradient of and set it to zero. We’ll walk through what the significance of is in the homework.