The big theme in our course so far has been the three-step modeling recipe, of choosing a model, choosing a loss function, and then minimizing empirical risk (i.e. average loss) to find optimal model parameters.
In Chapter 2.3, we used calculus to find the slope w1∗ and intercept w0∗ that minimized mean squared error,
Rsq(w0,w1)=n1i=1∑n(yi−(w0+w1xi))2
by computing ∂w0∂Rsq (the partial derivative with respect to w0) ∂w1∂Rsq, setting both to zero, and solving the resulting system of equations.
Then, in Chapters 7.1 and 7.2, we focused on the multiple linear regression model and the squared loss function, which saw us minimize
Rsq(w)=n1∥y−Xw∥2=n1i=1∑n(yi−w⋅Aug(xi))2
where X is the n×(d+1) design matrix, y∈Rn is the observation vector, and w∈Rd+1 is the parameter vector we’re trying to pick. In Chapter 6.3, we minimized Rsq(w) by arguing that the optimal w∗ had to create an error vector, e=y−Xw∗, that was orthogonal to colsp(X), which led us to the normal equations.
It turns out that there’s a way to use our calculus-based approach from Chapter 2.3 to minimize the more general version of Rsq for any d, that doesn’t involve computing d partial derivatives. To see how this works, we need to define a new object, the gradient vector, which we’ll do here in Chapter 8.1. After we’re familiar with how the gradient vector works, we’ll use it to build a new approach to function minimization, one that works even when there isn’t a closed-form solution for the optimal parameters: that technique is called gradient descent, which we’ll see in Chapter 8.3.
As we saw in Chapter 6.2 when we first introduced the concept of the inverse of a matrix, the notation
f:Rd→Rn
means that f is a function whose inputs are vectors with d components and whose outputs are vectors with n components. Rd is the domain of the function, and Rn is the codomain. I’ve used d and n to match the notation we’ve used for matrices and linear transformations. In general, if A is an n×d matrix, then any vector x multiplied by A (on the right) must be in Rd and the result Ax will be in Rn.
Given this framing, consider the following four types of functions.
The first two types of functions are “scalar-valued”, while the latter two are “vector-valued”. These are not the only types of functions that exist; for instance, the function f(A)=rank(A) is a matrix-to-scalar function.
The type of function we’re most concerned with at the moment are vector-to-scalar functions, i.e. functions that take in a vector (or equivalently, multiple scalar inputs) and output a single scalar.
Rsq(w)=n1∥y−Xw∥2
is one such function, and it’s the focus of this section.
Let’s think from the perspectives of rates of change, since ultimately what we’re building towards is a technique for minimizing functions. We’re most familiar with the concept of rates of change for scalar-to-scalar functions.
If
f(x)=x2sin(x)
then its derivative,
dxdf=2xsin(x)+x2cos(x)
itself is a scalar-to-scalar function, which describes how quickly f is changing at any point x in the domain of f. At x=3, for instance, the instantaneous rate of change is
dxdf(3)=2⋅3sin(3)+32cos(3)≈−8.06
meaning that at x=3, f is decreasing at a rate of (approximately) 8.06 per unit change in x. Perhaps a more intuitive way of thinking about the instantaneous rate of change is to think of it as the slope of the tangent line to f at x=3.
The steeper the slope, the faster f is changing at that point; the sign of the slope tells us whether f is increasing or decreasing at that point.
In Chapter 2.3, we saw how to compute derivatives of functions that take in multiple scalar inputs, like
f(x,y,z)=x2+2xy+3xz+4(y−z)2
In the language of Chapter 8.1, we’d call such a function a vector-to-scalar function, and might use the notation
f(x)=x12+2x1x2+3x1x3+4(x2−x3)2
This function has three partial derivatives, each of which describes the instantaneous rate of change of f with respect to one of its inputs, while holding the other two inputs constant. There’s a good animation of what it means to hold an input constant in Chapter 2.2 that is worth revisiting.
The big idea of this section, the gradient vector, packages all of these partial derivatives into a single vector. This will allow us to think about the direction in which f is changing, rather than just looking at its rates of change in each dimension independently.
As usual, we’ll start with an example. Suppose x∈R2, and let
f(x)=x1e−x12−x22
from new_grad_utils import make_3D_surface
make_3D_surface(
f = lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
lim = 2,
xaxis_title = 'x₁',
yaxis_title = 'x₂',
zaxis_title = 'f(x₁, x₂)',
title='',
)
Loading...
To find ∇f(x), we need to compute the partial derivatives of f with respect to each component of x. The “input variables” to f are x1 and x2, so we need to compute ∂x1∂f and ∂x2∂f, but if you’d like, replace x1 and x2 with x and y if it makes the algebra a little cleaner, and then replace x and y with x1 and x2 at the end.
What does ∇f([−10])=[−1/e0] really tell us? In order to visualize it, let me introduce another way of visualizing f, called a contour plot.
from new_grad_utils import make_3D_contour
fig = make_3D_contour(
f = lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
lim = 2,
xaxis_title = 'x₁',
yaxis_title = 'x₂',
title=r'$$\text{Contour plot of } f(\vec x) = x_1 e^{-x_1^2 - x_2^2}$$',
)
fig.update_layout(width=700, height=700)
Loading...
I think of the contour plot as a birds-eye view 🦅 of f when you look at it from above. when you look at the surface from above. Notice the correspondence between the colors in both graphs.
The circle-like traces in the contour plot are called level curves; they represent slices through the surface at a constant height. On the right, the circle labeled 0.1 represents the set of points where f(x1,x2)=0.1.
Visualizing the fact that ∇f([−10])=[−1/e0] is easier to do in the contour plot, since the contour plot is 2-dimensional, like the gradient vector is. Remember that red values are high and blue values are low.
import numpy as np
def plot_gradient_on_contour(f, dfx1, dfx2, point, lim=2, **kwargs):
# Set a longer arrow for demonstration when using the (-1, 0) point
is_minus1_0 = np.allclose(point, (-1, 0))
# Remove 'arrow_len' from kwargs before passing into make_3D_contour
# since make_3D_contour does not accept 'arrow_len'
filtered_kwargs = dict(kwargs)
if 'arrow_len' in filtered_kwargs:
filtered_kwargs.pop('arrow_len')
# Don't add 'arrow_len' for make_3D_contour; let 3D surface handle it instead
return make_3D_contour(
f=f,
dfx1=dfx1,
dfx2=dfx2,
lim=lim,
with_gradient=True,
grad_point=point,
**filtered_kwargs
)
def plot_gradient_on_surface(f, dfx1, dfx2, point, lim=2, **kwargs):
"""
Plot 3D surface of f, mark the given point, and plot the gradient vector
as a dotted arrow (dashed line with a cone/arrows) coming out of that point.
Args:
f: function of two variables (x, y)
dfx1: function, df/dx1(x, y)
dfx2: function, df/dx2(x, y)
point: tuple/list/np.array of shape (2,) representing the location (x0, y0)
lim: limits for the plot axes
**kwargs: keyword args for customization
Returns:
Plotly figure.
"""
fig = make_3D_surface(
f=f,
lim=lim,
xaxis_title=kwargs.get('xaxis_title', 'x₁'),
yaxis_title=kwargs.get('yaxis_title', 'x₂'),
zaxis_title=kwargs.get('zaxis_title', 'f(x₁, x₂)'),
title=kwargs.get('surface_title', '3D Surface with Gradient Arrow')
)
x0, y0 = point
z0 = f(x0, y0)
# Compute the gradient vector at the given point
dx = dfx1(x0, y0)
dy = dfx2(x0, y0)
# Draw the starting (base) point as a marker
fig.add_scatter3d(
x=[x0], y=[y0], z=[z0],
mode='markers',
marker=dict(size=8, color='gold'),
showlegend=False
)
# Draw the gradient vector as a dotted (dashed) arrow out of that point
# Make the arrow longer if the gradient is small, as at (-1, 0)
is_minus1_0 = np.allclose([x0, y0], [-1, 0])
arrow_len = 1 # kwargs.get('arrow_len', 0.5)
if is_minus1_0:
arrow_len = 1.0 # Draw a relatively long arrow for this specific point
arrow_color = kwargs.get('arrow_color', 'gold')
# Use actual gradient magnitude for visual length, scaled by arrow_len
grad_norm = (dx**2 + dy**2) ** 0.5
if grad_norm == 0:
ax, ay = 0, 0
else:
# Scale the actual gradient components by arrow_len to control overall scale
ax = dx * arrow_len
ay = dy * arrow_len
# print(ax, ay)
# Approximate change in z using the tangent plane direction: dz = gradient dot (Δx, Δy)
dz = dx * ax + dy * ay
# Draw a dotted line for the "arrow body"
fig.add_trace(dict(
type='scatter3d',
x=[x0, x0 + ax],
y=[y0, y0 + ay],
z=[z0, z0 + dz],
mode='lines',
line=dict(color=arrow_color, width=5, dash='dot'),
showlegend=False
))
# Draw a cone (arrowhead) at the tip
fig.add_trace(
dict(
type="cone",
x=[x0 + ax],
y=[y0 + ay],
z=[z0 + dz],
u=[ax],
v=[ay],
w=[dz],
showscale=False,
colorscale=[[0, arrow_color],[1, arrow_color]],
sizemode="absolute",
sizeref=0.18 * arrow_len,
anchor="tip"
)
)
# Optionally, plot a marker at the tip of the arrow for emphasis
fig.add_scatter3d(
x=[x0 + ax], y=[y0 + ay], z=[z0 + dz],
mode='markers',
marker=dict(size=2, color=arrow_color, symbol='diamond'),
showlegend=False
)
return fig
def plot_gradient_side_by_side(f, dfx1, dfx2, point, lim=2, **kwargs):
return show_surface_and_contour_side_by_side(
f=f,
dfx1=dfx1,
dfx2=dfx2,
lim=lim,
with_gradient=True,
grad_point=point,
**kwargs
)
# Correct usage: Pass only the functions for gradient computation and the point.
fig = plot_gradient_on_contour(
f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
point=(-1, 0),
xaxis_title='x₁',
yaxis_title='x₂',
)
# Ensure the grid is square by setting aspectmode='equal' for both axes.
fig.update_layout(
width=700,
height=700,
)
fig
Loading...
At the point [−10], which is at the tail of the vector drawn in gold, f is near the global minimum, meaning there are lots of directions in which we can move to increase f. But, the gradient vector at this point is [−1/e0], which points in the direction of steepest ascent starting at [−10]. The gradient describes the “quickest way up”.
As another example, consider the fact that ∇f([1.25−0.5])≈[−0.3470.204].
fig = plot_gradient_on_contour(
f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
point=(1.25, -0.5),
xaxis_title='x₁',
yaxis_title='x₂',
)
fig.update_layout(
width=700,
height=700
)
Loading...
Again, the gradient at [1.25−0.5] gives us the direction in which f is increasing the quickest at that very point. If we move even a little bit in any direction (in the direction of the gradient or some other direction), the gradient will change.
As you might guess, to find the critical points of a function – that is, places where it is neither increasing nor decreasing – we need to find points where the gradient is zero. Hold that thought.
Next, we work through concrete examples of gradients and matrix-vector operations.