Skip to article frontmatterSkip to article content

4.2. Gradient Descent

In Chapter 4.1, we learned about the gradient, the equivalent of the derivative for vector-to-scalar functions. The big takeaway was that f(x)\nabla f(\vec x) describes the direction in which ff is increasing the quickest, at the point x\vec x.

At the end of Chapter 4.1, we minimized

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

by computing its gradient, setting it to 0, and solving for w\vec w^*, which gave us another way of deriving the normal equations.

So far, most of the empirical risk functions that we’ve minimized had closed-form solutions – that is, formulas for the optimal parameters w0,w1,...w_0^*, w_1^*, ... that we could find by hand using algebra.

Rsq(w)=1ni=1n(yiw)2    w=yˉRsq(w0,w1)=1ni=1n(yi(w0+w1xi))2    w1=rσyσx,w0=yˉw1xˉRsq(w)=1nyXw2    w=(XTX)1XTy\begin{align*} R_\text{sq}(w) &= \frac{1}{n} \sum_{i = 1}^n (y_i - w)^2 &&\implies\quad w^* = \bar{y} \\ R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 &&\implies\quad w_1^* = r \frac{\sigma_y}{\sigma_x},\quad w_0^* = \bar{y} - w_1^* \bar{x} \\ R_\text{sq}(\vec w) &= \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 &&\implies\quad \vec{w}^* = (X^T X)^{-1} X^T \vec{y} \end{align*}

But, soon we’ll encounter combinations of models and loss functions whose empirical risk functions that have some global minimum, but that global minimum isn’t described by a formula that we can write by hand. There was an example of such an empirical risk function in Homework 2, Problem 5; see if you can remember what it was!

As another example, the logistic regression model, used for predicting the probability of a binary event (e.g. whether or not a patient has a disease) makes predictions using

h(xi)=P(yi=1xi)=11+ewAug(xi)h(\vec x_i) = P(y_i = 1 \mid \vec x_i) = \frac{1}{1 + e^{- \vec w \cdot \text{Aug}(\vec x_i)}}

Logistic regression typically uses the cross-entropy loss function,

Lce(yi,h(xi))=(yilogh(xi)+(1yi)log(1h(xi)))L_\text{ce}(y_i, h(\vec x_i)) = - \left( y_i \log h(\vec x_i) + (1 - y_i) \log (1 - h(\vec x_i)) \right)

resulting in the empirical risk function

Rce(w)=1ni=1n(yilog(11+ewAug(xi))+(1yi)log(111+ewAug(xi)))R_\text{ce}(\vec w) = - \frac{1}{n} \sum_{i = 1}^n \left( y_i \log \left( \frac{1}{1 + e^{- \vec w \cdot \text{Aug}(\vec x_i)}} \right) + (1 - y_i) \log \left( 1 - \frac{1}{1 + e^{- \vec w \cdot \text{Aug}(\vec x_i)}} \right) \right)

(What a mess!)

Rce(w)R_\text{ce}(\vec w) is a vector-to-scalar function, and it has a gradient, Rce(w)\nabla R_\text{ce}(\vec w). The issue is that the solutions to Rce(w)=0\nabla R_\text{ce}(\vec w) = \vec 0 can’t be found by hand. But, Rce(w)\nabla R_\text{ce}(\vec w) still means something – and we can use it to estimate w\vec w^* without solving for it explicitly.

Throughout this section, keep the following thought exercise in mind:

Suppose you’re at the top of a mountain 🏔️ and need to get to the bottom. And, perhaps it’s really cloudy ☁️, meaning you can only see a few feet around you. How would you get to the bottom?


What do Derivatives Tell Us?

Let’s start with a simpler, scalar-to-scalar example. Suppose we’d like to minimize the function

f(x)=5x4x35x2+2x9f(x) = 5x^4 - x^3 - 5x^2 + 2x - 9
Image produced in Jupyter

There are some not-so-elegant techniques for minimizing ff. For instance, we could evaluate ff at dozens (or hundreds) of possible xx’s, and pick the one that had the smallest output. But, that’s an inefficient way to go about things, especially as the number of input variables (i.e. the dd in xRd\vec x \in \mathbb{R}^d) increases.

Instead, we’ll use the fact thatff is differentiable. Its derivative is

dfdx(x)=20x33x210x+2\frac{\text{d}f}{\text{d}x}(x) = 20x^3 - 3x^2 - 10x + 2

Typically, to minimize ff, we’d

  1. Find dfdx(x)\frac{\text{d}f}{\text{d}x}(x).

  2. Solve for the input xx^* such that dfdx(x)=0\frac{\text{d}f}{\text{d}x}(x^*) = 0.

But, dfdx(x)=0\frac{\text{d}f}{\text{d}x}(x) = 0 is a cubic equation, which is difficult to solve by hand (there is actually a “cubic formula”, the same way there’s a “quadratic formula”, but higher-degree polynomials don’t have similar formulas). What can we do with its derivative?

The key idea is that we’ll take an iterative approach. Suppose we start with an initial guess for xx^*, say, x(0)x^{(0)}.

Case 1: If the derivative at x(0)x^{(0)} is positive ⬆️, then ff is increasing at x(0)x^{(0)}, which means to decrease ff, we should move to the left of x(0)x^{(0)}.

Image produced in Jupyter

Case 2: If the derivative at x(0)x^{(0)} is negative 📉, then ff is decreasing at x(0)x^{(0)}, which means to decrease ff, we should move to the right of x(0)x^{(0)} ➡️.

Image produced in Jupyter

Remember that at a minimum (or maximum), the derivative is 0 (if it exists), and gradually approaches 0, at least for the types of functions we’ll consider. So, if the derivative at our current guess is large, we must be far away from a minimum, and should take a larger step than if the derivative is small (which must mean we’re close to the minimum already).

This intuition is the basic idea behind gradient descent. The fact that “gradient” is in the name implies that it’s typically used for minimizing vector-to-scalar functions f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R}, which we will use it for shortly. I just figured a scalar-to-scalar example would help build intuition.


Gradient Descent

Gradient descent is a numerical method for finding the input to a function ff that minimizes the function. A numerical method is a technique for approximating the solution to a mathematical problem, often by using the computer. Since it only involves first (partial) derivatives, it’s called a first-order method; numerical methods that use second (partial) derivatives are called second-order methods.

Gradient descent is the workhorse of machine learning. As I stated earlier, it’s used to minimize empirical risk, R(w)R(\vec w), in the general case, where it can’t be minimized by hand. (In the first few examples we’ll see, we’ll use gradient descent to minimize arbitrary functions of the form f(x)f(\vec x), but we’ll later return to minimizing empirical risk functions specifically.) This is especially true for state-of-the-art models, like neural networks and transformers, which have billions of parameters. So, when you hear about companies spending billions of dollars on training models, they’re spending money to run gradient descent on their oceans of data.

Example Implementation

To get a feel for how it works, let’s implement gradient descent ourselves on the scalar-to-scalar function we looked at earlier,

f(x)=5x4x35x2+2x9,dfdx(x)=20x33x210x+2f(x) = 5x^4 - x^3 - 5x^2 + 2x - 9, \qquad \frac{\text{d}f}{\text{d}x}(x) = 20x^3 - 3x^2 - 10x + 2

For scalar-to-scalar functions, we can replace the gradient with the derivative, and the update rule becomes

x(t+1)=x(t)αdfdx(x(t))x^{(t+1)} = x^{(t)} - \alpha \frac{\text{d}f}{\text{d}x}(x^{(t)})

Let’s start with an initial guess of x(0)=0x^{(0)} = 0 and a learning rate of α=0.01\mathbf{\alpha} = 0.01.

def df(x):
    return 20 * (x**3) - 3 * (x**2) - 10 * x + 2

def minimize_f(x0, alpha, tol=0.0001):
    x = x0 # Initial guess
    i = 0 # Iteration counter (just for tracking)

    while np.abs(df(x)) > tol:
        # The core logic is in this one line
        x = x - alpha * df(x)

        # Everything below is for us to track the progress of the algorithm
        i += 1
        if i % 10 == 0:
            print(f'Iteration {i}: x = {x}, df/dx = {df(x)}')
    print(f"Converged in {i} iterations")

minimize_f(x0=0, alpha=0.01)
Iteration 10: x = -0.3019961066782706, df/dx = 4.195505266352445
Iteration 20: x = -0.6539833552473451, df/dx = 1.6626527284020556
Iteration 30: x = -0.7227462183452047, df/dx = 0.10967130332536357
Iteration 40: x = -0.7267760078170167, df/dx = 0.005439315124045052
Iteration 50: x = -0.7269745284952817, df/dx = 0.0002654471448240159
Converged in 54 iterations

With our initial guess of x(0)=0x^{(0)} = 0 and a learning rate of α=0.01\alpha = 0.01, the algorithm converges to x0.727x^* \approx -0.727 in 54 iterations.

Animations

Let’s visualize the execution of the algorithm on this f(x)f(x), with several different choices of initial guess and learning rate. In each animation, click the “▶️ Start animation” button to see the algorithm in action. (If the algorithm stops somewhere that isn’t a minimum, it’s because I set them to only show 50 iterations.)

Initial guess = 0, step size = 0.01

Loading...

The algorithm converges to the true global minimum, but it takes a while.

What if we choose a different initial guess?

Initial guess = 1.1, step size = 0.01

Loading...

Uh oh: the algorithm gets trapped in a local minimum that isn’t the global minimum! From its perspective, local and global minima are the same, in that they have derivatives of 0. Gradient descent doesn’t seem well-suited for functions with multiple local minima.

What if we try a different step size?

Initial guess = 1.1, step size = 0.1

Loading...

At first, seemingly by luck, the algorithm jumps over to the neighborhood of the global minimum. But our step size appears to be too large, causing the algorithm to keep jumping back and forth across the global minimum.

The choice of step size is critical. In future courses, you may encounter some theoretical results that give you insight on how to choose the step size, but in practice, we often just try different step sizes and see what works. Another technique is to choose a decaying learning rate, in which the value of α\alpha decreases over time.


Vector-to-Scalar Functions

Let’s look at another more complex example,

f(x)=3sin(2x1)cos(2x2)+x12+x22f(\vec x) = 3 \sin(2 x_1) \cos(2 x_2) + x_1^2 + x_2^2

Note that I’ve chosen a function f:R2Rf: \mathbb{R}^2 \rightarrow \mathbb{R}, since the resulting function exists in three dimensions, which is the largest space we can visualize. If ff took in vectors with 3, or 4, or more components, we couldn’t visualize how gradient descent operations on it.

That said, we’ll visualize f(x)f(\vec x) in two ways: as a surface and as a contour plot. (Remember, you should think of the latter as a “birds-eye view” of the former.)

Loading...
Loading...

This is the type of function that gradient descent is often used on in practice: functions with multiple (remember, billions! of) input variables, often with local minima.

We can see that f(x)f(\vec x) has a global minimum around x[0.60]\vec x^* \approx \begin{bmatrix} -0.6 \\ 0 \end{bmatrix}. But, computers don’t have eyes, and instead need to rely on gradient descent. Recall, gradient descent updates are given by

x(t+1)=x(t)αf(x(t))\vec x^{(t+1)} = \vec x^{(t)} - \alpha \nabla f(\vec x^{(t)})

f(x)f(\vec x)'s gradient is

f(x)=[6cos(2x1)cos(2x2)+2x16sin(2x1)sin(2x2)+2x2]\nabla f(\vec x) = \begin{bmatrix} 6\cos(2x_1)\cos(2x_2) + 2x_1 \\ -6\sin(2x_1)\sin(2x_2) + 2x_2 \end{bmatrix}

so, gradient descent updates are given by

x(t+1)=x(t)α[6cos(2x1)cos(2x2)+2x16sin(2x1)sin(2x2)+2x2]\vec x^{(t+1)} = \vec x^{(t)} - \alpha \begin{bmatrix} 6\cos(2x_1)\cos(2x_2) + 2x_1 \\ -6\sin(2x_1)\sin(2x_2) + 2x_2 \end{bmatrix}
Loading...
Loading...

The negative of the gradient vector is the direction we want to move in! The amount we move in this direction is determined by both f(x)\lVert \nabla f(\vec x) \rVert and α\mathbf{\alpha}.

Let’s visualize the path of gradient descent on this function, again at several different initial guesses and learning rates.

Initial guess = [10.5]\begin{bmatrix} 1 \\ -0.5 \end{bmatrix}, step size = 0.1

Loading...

x(t)\vec x^{(t)} convered to a local minimum, of which there are many in this function. Also, notice that we’ve chosen a step size (α=0.1\mathbf{\alpha} = 0.1) that was so large in the earlier polynomial example that it caused the algorithm to diverge, but it worked here. There is no universal best step size.

Initial guess = [1.51]\begin{bmatrix} -1.5 \\ -1 \end{bmatrix}, step size = 0.1

Loading...

Initial guess = [1.51]\begin{bmatrix} -1.5 \\ -1 \end{bmatrix}, step size = 0.25

Loading...

Once again, we chose a step size that was too large, causing the algorithm to diverge.

We can visualize the same path on the surface itself.

Loading...

Remember not to rely too heavily on visual intuition, since practical examples will take us into higher dimensions, where we can’t visualize. But I think both the surface and contour plots are helpful.

It seems that the ability for gradient descent to converge on the global minimum depends on lots of factors:

  • The existence of “traps” – that is, local minima that aren’t the global minimum.

  • The step size, α\alpha.

  • The initial guess.

If there are no local minima, then gradient descent will converge to the global minimum, given a sufficiently small step size. The type of function that has no local minima is called convex. Intuitively, a convex function has a “bowl-like” shape, as you have seen in calculus. In Chapter 4.3, we’ll study the idea of convexity in more detail.


Gradient Descent for Empirical Risk Minimization

While gradient descent can be used to (attempt to) minimize any differentiable function f(x)f(\vec x), we typically use it to minimize empirical risk functions, R(w)R(\vec w).

Let’s try using gradient descent to fit a linear regression model – that is, let’s use it to minimize

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

This function has a closed-form solution, but it’s worthwhile to see how gradient descent works on it.

In Chapter 2.1, we found that the gradient of Rsq(w)R_\text{sq}(\vec w) is

Rsq(w)=2n(XTXwXTy)\nabla R_\text{sq}(\vec w) = \frac{2}{n} (X^TX \vec w - X^T \vec y)

so, the update rule is

w(t+1)=w(t)α2n(XTXw(t)XTy)\vec w^{(t+1)} = \vec w^{(t)} - \alpha \frac{2}{n} (X^TX \vec w^{(t)} - X^T \vec y)

Let’s start by using gradient descent to fit a simple linear regression model to predict commute times in minutes from departure_hour – a problem we’ve solved many times.

Image produced in Jupyter

More to come! We’ll cover this example on Tuesday, and talk more about convexity (and, time permitting, variants of gradient descent for large datasets).