4.1. The Gradient Vector - EECS 245 Course Notes

Domain and Codomain¶

As we saw in Chapter 2.9 when we first introduced the concept of the inverse of a matrix, the notation

f: \mathbb{R}^d \to \mathbb{R}^n

means that $f$ is a function whose inputs are vectors with $d$ components and whose outputs are vectors with $n$ components. $\mathbb{R}^d$ is the domain of the function, and $\mathbb{R}^n$ is the codomain. I’ve used $d$ and $n$ to match the notation we’ve used for matrices and linear transformations. In general, if $A$ is an $n \times d$ matrix, then any vector $\vec x$ multiplied by $A$ (on the right) must be in $\mathbb{R}^d$ and the result $A \vec x$ will be in $\mathbb{R}^n$ .

Given this framing, consider the following four types of functions.

Type	Domain and Codomain	Examples
Scalar-to-scalar	$f: \mathbb{R} \to \mathbb{R}$	$f(x) = x^2 + \sin(x)$ $R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2$
Vector-to-scalar	$f: \mathbb{R}^d \to \mathbb{R}$	$f(\vec x) = \vec x^T \vec x$ $f(\vec x) = (x_1 - 1)^2 + (x_1 - x_2)^2 + 3$ $R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$
Scalar-to-vector	$f: \mathbb{R} \to \mathbb{R}^n$	$f(x) = \underbrace{\begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} + x \begin{bmatrix} 3 \\ 0 \\ -4 \end{bmatrix}}_{\text{parametric form of a line}}$ $f(x) = \begin{bmatrix} x^2 + e^{x} \\ -3 \end{bmatrix}$
Vector-to-vector	$f: \mathbb{R}^d \to \mathbb{R}^n$	$\underbrace{f(\vec x) = \begin{bmatrix} 3 & 4 \\ 2 & 0 \\ 0 & 1 \end{bmatrix} \vec x}_{\text{linear transformation}}$ $f(\vec x) = \begin{bmatrix} x_1^2 + x_1x_2 + \cos(x_1^4) \\ 3x_1x_2 + 4 \\ 5x_1 \\ x_2 / x_3 \end{bmatrix}$

The first two types of functions are “scalar-valued”, while the latter two are “vector-valued”. These are not the only types of functions that exist; for instance, the function $f(A) = \text{rank}(A)$ is a matrix-to-scalar function.

The type of function we’re most concerned with at the moment are vector-to-scalar functions, i.e. functions that take in a vector (or equivalently, multiple scalar inputs) and output a single scalar.

R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

is one such function, and it’s the focus of this section.

Rates of Change¶

Let’s think from the perspectives of rates of change, since ultimately what we’re building towards is a technique for minimizing functions. We’re most familiar with the concept of rates of change for scalar-to-scalar functions.

If

f(x) = x^2 \sin(x)

then its derivative,

\frac{\text{d}f}{\text{d}x} = 2x \sin(x) + x^2 \cos(x)

itself is a scalar-to-scalar function, which describes how quickly $f$ is changing at any point $x$ in the domain of $f$ . At $x = 3$ , for instance, the instantaneous rate of change is

\frac{\text{d}f}{\text{d}x}(3) = 2\cdot 3 \sin(3) + 3^2 \cos(3) \approx -8.06

meaning that at $x = 3$ , $f$ is decreasing at a rate of (approximately) 8.06 per unit change in $x$ . Perhaps a more intuitive way of thinking about the instantaneous rate of change is to think of it as the slope of the tangent line to $f$ at $x = 3$ .

import numpy as np
from calc_utils import plot_function_with_tangent_line

f = lambda x: (x ** 2) * np.sin(x)
f_prime = lambda x: 2 * x * np.sin(x) + x ** 2 * np.cos(x)
x_range = (-6, 6)
y_range = (-6, 6)

fig = plot_function_with_tangent_line(f, f_prime, x_range, y_range, initial_x_point=3, dtick=0.5)
fig.update_layout(width=800, height=600)
fig.show(renderer='notebook');

Loading...

The steeper the slope, the faster $f$ is changing at that point; the sign of the slope tells us whether $f$ is increasing or decreasing at that point.

In Chapter 1.4, we saw how to compute derivatives of functions that take in multiple scalar inputs, like

f(x, y, z) = x^2 + 2xy + 3xz + 4(y - z)^2

In the language of Chapter 4.1, we’d call such a function a vector-to-scalar function, and might use the notation

f(\vec x) = x_1^2 + 2x_1x_2 + 3x_1x_3 + 4(x_2 - x_3)^2

This function has three partial derivatives, each of which describes the instantaneous rate of change of $f$ with respect to one of its inputs, while holding the other two inputs constant. There’s a good animation of what it means to hold an input constant in Chapter 1.4 that is worth revisiting.

Here,

\frac{\partial f}{\partial x_1} = 2x_1 + 2x_2 + 3x_3, \quad \frac{\partial f}{\partial x_2} = 2x_1 + 8x_2 - 8x_3, \quad \frac{\partial f}{\partial x_3} = 3x_1 - 8x_2 + 8x_3

The big idea of this section, the gradient vector, packages all of these partial derivatives into a single vector. This will allow us to think about the direction in which $f$ is changing, rather than just looking at its rates of change in each dimension independently.

The Gradient Vector¶

Definition: Gradient Vector

Suppose $f: \mathbb{R}^d \to \mathbb{R}$ is a vector-to-scalar function. The gradient vector of $f$ , denoted $\nabla f(\vec x)$ , is the vector in $\mathbb{R}^d$ of partial derivatives of $f$ :

\nabla f(\vec x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{bmatrix}

$\nabla f(\vec x)$ itself is a vector-to-vector function; it takes in a vector $\vec x \in \mathbb{R}^d$ and outputs a new vector in $\mathbb{R}^d$ , describing the rates of change of $f$ along each dimension. The gradient, when evaluated at a point $\vec x_0$ describes the direction of steepest ascent of $f$ at $\vec x_0$ , i.e. the direction in which $f$ is increasing most quickly.

As usual, we’ll start with an example. Suppose $x \in \mathbb{R}^2$ , and let

f(\vec x) = x_1 e^{-x_1^2 - x_2^2}

import plotly.graph_objects as go

def make_3D_surface(f, lim, xaxis_title, yaxis_title, zaxis_title, title):
    x1_range = np.linspace(-lim, lim, 100)
    x2_range = np.linspace(-lim, lim, 100)
    x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
    z_values = f(x1_grid, x2_grid)

    fig = go.Figure(data=[
        go.Surface(
            x=x1_grid,
            y=x2_grid,
            z=z_values,
            colorscale='RdBu_r',
            showscale=False,
        )
    ])
    fig.update_layout(
        title=title,
        scene=dict(
            xaxis=dict(
                title=xaxis_title, gridcolor="#f0f0f0",
                showbackground=True, showline=True, linecolor="black", linewidth=1,
                tickfont=dict(family='Palatino', size=10),
                backgroundcolor="white"
            ),
            yaxis=dict(
                title=yaxis_title, gridcolor="#f0f0f0",
                showbackground=True, showline=True, linecolor="black", linewidth=1,
                tickfont=dict(family='Palatino', size=10),
                backgroundcolor="white"
            ),
            zaxis=dict(
                title=zaxis_title, gridcolor="#f0f0f0",
                showbackground=True, showline=True, linecolor="black", linewidth=1,
                tickfont=dict(family='Palatino', size=10),
                backgroundcolor="white"
            ),
            aspectratio=dict(x=1, y=1, z=1),
            camera=dict(eye=dict(x=1.5, y=1.5, z=1))
        ),
        width=800,
        height=700,
        margin=dict(l=65, r=50, b=65, t=90),
        font=dict(family='Palatino', size=16, color="#222"),
        paper_bgcolor="white",
        plot_bgcolor="white",
        showlegend=False,
    )
    return fig


def make_3D_contour(
    f,
    dfx1=None,
    dfx2=None,
    lim=2,
    xaxis_title="w1",
    yaxis_title="w2",
    title=None,
    with_gradient=False,
    grad_point=None,
    neg=False,
    contour_kwargs=None,
):
    import numpy as np
    import plotly.graph_objects as go

    x1_range = np.linspace(-lim, lim, 100)
    x2_range = np.linspace(-lim, lim, 100)
    x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
    z_values = f(x1_grid, x2_grid)
    if contour_kwargs is None:
        contour_kwargs = {}

    base_contour = dict(
        z=z_values,
        x=x1_range,
        y=x2_range,
        colorscale='RdBu_r',
        showscale=False,
        contours=dict(
            showlabels=True,
            labelfont=dict(size=12, color='white', family='Palatino'),
        ),
    )
    base_contour.update(contour_kwargs)
    fig = go.Figure(data=[
        go.Contour(**base_contour)
    ])
    if with_gradient and dfx1 is not None and dfx2 is not None and grad_point is not None:
        w1_start, w2_start = grad_point
        # Use actual gradient magnitude for visual length, scaled by a factor
        grad_x, grad_y = dfx1(w1_start, w2_start), dfx2(w1_start, w2_start)
        grad_magnitude = (grad_x**2 + grad_y**2) ** 0.5
        # Scale factor to control overall arrow size
        scale_factor = 1
        if grad_magnitude > 0:
            # Scale the actual gradient components by scale_factor
            scaled_grad_x = grad_x * scale_factor
            scaled_grad_y = grad_y * scale_factor
        else:
            scaled_grad_x, scaled_grad_y = 0, 0

        if neg:
            w1_end, w2_end = w1_start - scaled_grad_x, w2_start - scaled_grad_y
        else:
            w1_end, w2_end = w1_start + scaled_grad_x, w2_start + scaled_grad_y

        fig.add_trace(go.Scatter(
            x=[w1_start, w1_end], y=[w2_start, w2_end],
            mode='lines+markers',
            line=dict(color='red' if neg else 'gold', width=5),
            showlegend=False,
        ))

        # Adjust the arrow's tail so the arrowhead is at exactly the end of the gradient vector.
        # The arrow annotation should have (ax, ay) at the real 'start' and (x, y) at the tip.
        if neg:
            # Arrow from start (w1_start, w2_start) to tip (w1_end, w2_end), negative gradient
            annotation_x = w1_end
            annotation_y = w2_end
            annotation_ax = w1_start
            annotation_ay = w2_start
        else:
            # Arrow from start (w1_start, w2_start) to tip (w1_end, w2_end), positive gradient
            annotation_x = w1_end
            annotation_y = w2_end
            annotation_ax = w1_start
            annotation_ay = w2_start

        fig.add_annotation(
            x=annotation_x, y=annotation_y,
            ax=annotation_ax, ay=annotation_ay,
            xref='x', yref='y', axref='x', ayref='y',
            showarrow=True, arrowhead=5, arrowsize=1,
            arrowwidth=3, arrowcolor='red' if neg else 'gold'
        )
    if with_gradient and grad_point is not None:
        col = 'red' if neg else 'gold'
        text = 'Negative of the Gradient Vector' if neg else 'Gradient Vector'
        final_title = f'<span style="color:{col}"><b>{text}</b></span> at Point ({grad_point[0]}, {grad_point[1]})'
    else:
        final_title = title or ''
    fig.update_layout(
        title=final_title,
        xaxis=dict(
            title=xaxis_title, gridcolor="#f0f0f0",
            showline=True, linecolor="black", linewidth=1,
            tickfont=dict(family='Palatino', size=10)
        ),
        yaxis=dict(
            title=yaxis_title, gridcolor="#f0f0f0",
            showline=True, linecolor="black", linewidth=1,
            tickfont=dict(family='Palatino', size=10)
        ),
        width=800,
        height=700,
        margin=dict(l=65, r=50, b=65, t=90),
        showlegend=True,
        font=dict(family='Palatino', size=16, color="#222"),
        paper_bgcolor="white",
        plot_bgcolor="white",
    )
    return fig


def show_surface_and_contour_side_by_side(
    f,
    dfx1=None,
    dfx2=None,
    lim=2,
    xaxis_title="x₁",
    yaxis_title="x₂",
    zaxis_title="f(x₁, x₂)",
    surface_title="3D Surface",
    contour_title="Contour",
    with_gradient=False,
    grad_point=None,
    neg=False,
    contour_kwargs=None,
    width=800,
    height=1200,
):
    """
    Display both 3D surface and 2D contour stacked vertically (2 rows, 1 column) with shared stylings.
    """
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots

    surface_fig = make_3D_surface(
        f=f,
        lim=lim,
        xaxis_title=xaxis_title,
        yaxis_title=yaxis_title,
        zaxis_title=zaxis_title,
        title=surface_title,
    )
    contour_fig = make_3D_contour(
        f=f,
        dfx1=dfx1, dfx2=dfx2, lim=lim,
        xaxis_title=xaxis_title,
        yaxis_title=yaxis_title,
        title=contour_title,
        with_gradient=with_gradient,
        grad_point=grad_point,
        neg=neg,
        contour_kwargs=contour_kwargs,
    )
    subplot_fig = make_subplots(
        rows=2, cols=1,
        subplot_titles=(surface_title, contour_title),
        specs=[[{'type': 'surface'}], [{'type': 'xy'}]],
        vertical_spacing=0.13
    )
    # Transfer surface traces (3D plot)
    for trace in surface_fig.data:
        subplot_fig.add_trace(trace, row=1, col=1)
    for trace in contour_fig.data:
        subplot_fig.add_trace(trace, row=2, col=1)

    # Transfer contour annotations (gradient arrow)
    if hasattr(contour_fig, "layout") and "annotations" in contour_fig.layout:
        subplot_fig.layout.annotations += tuple(contour_fig.layout.annotations)

    # Layout: shared formatting
    subplot_fig.update_layout(
        height=height,
        width=width,
        margin=dict(l=40, r=40, b=60, t=100),
        font=dict(family='Palatino', size=16, color="#222"),
        paper_bgcolor="white",
        plot_bgcolor="white",
        showlegend=False,
    )

    subplot_fig.update_scenes(
        xaxis_title=xaxis_title,
        yaxis_title=yaxis_title,
        zaxis_title=zaxis_title,
        xaxis=dict(
            gridcolor="#f0f0f0", showbackground=True,
            showline=True, linecolor="black", linewidth=1,
            tickfont=dict(family='Palatino', size=10),
            backgroundcolor="white"
        ),
        yaxis=dict(
            gridcolor="#f0f0f0", showbackground=True,
            showline=True, linecolor="black", linewidth=1,
            tickfont=dict(family='Palatino', size=10),
            backgroundcolor="white"
        ),
        zaxis=dict(
            gridcolor="#f0f0f0", showbackground=True,
            showline=True, linecolor="black", linewidth=1,
            tickfont=dict(family='Palatino', size=10),
            backgroundcolor="white"
        ),
        aspectratio=dict(x=1, y=1, z=1),
        camera=dict(eye=dict(x=1.5, y=1.5, z=1))
    )
    subplot_fig.update_xaxes(
        title_text=xaxis_title, row=2, col=1,
        showline=True, linecolor="black", linewidth=1,
        showgrid=True, gridcolor="#f0f0f0",
        tickfont=dict(family='Palatino', size=10)
    )
    subplot_fig.update_yaxes(
        title_text=yaxis_title, row=2, col=1,
        showline=True, linecolor="black", linewidth=1,
        showgrid=True, gridcolor="#f0f0f0",
        tickfont=dict(family='Palatino', size=10)
    )

    return subplot_fig

make_3D_surface(
    f = lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    lim = 2,
    xaxis_title = 'x₁',
    yaxis_title = 'x₂',
    zaxis_title = 'f(x₁, x₂)',
    title='',
)

# show_surface_and_contour_side_by_side(
#     f = lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
#     xaxis_title = 'x₁',
#     yaxis_title = 'x₂',
#     zaxis_title = 'f(x₁, x₂)',
#     width=700,
#     height=1000,
# )

Loading...

To find $\nabla f(\vec x)$ , we need to compute the partial derivatives of $f$ with respect to each component of $\vec x$ . The “input variables” to $f$ are $x_1$ and $x_2$ , so we need to compute $\frac{\partial f}{\partial x_1}$ and $\frac{\partial f}{\partial x_2}$ , but if you’d like, replace $x_1$ and $x_2$ with $x$ and $y$ if it makes the algebra a little cleaner, and then replace $x$ and $y$ with $x_1$ and $x_2$ at the end.

f(\vec x) = x_1 e^{-x_1^2 - x_2^2}

\frac{\partial f}{\partial x_1} = \frac{\partial}{\partial x_1} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{1 \cdot e^{-x_1^2 - x_2^2} + x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_1)}_{\text{product rule}} = (1 - 2x_1^2) e^{-x_1^2 - x_2^2}

\frac{\partial f}{\partial x_2} = \frac{\partial}{\partial x_2} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_2)}_{\text{chain rule}} = -2x_1 x_2 e^{-x_1^2 - x_2^2}

Putting these together, we have

\nabla f(\vec x) = \begin{bmatrix} (1 - 2x_1^2) e^{-x_1^2 - x_2^2} \\ -2x_1 x_2 e^{-x_1^2 - x_2^2} \end{bmatrix}

Remember, $\nabla f(\vec x)$ itself is a function. If we plug in a value of $\vec x$ , we get a new vector back.

\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} (1 - 2(-1)^2) e^{-(-1)^2 - 0^2} \\ -2(-1)(0) e^{-(-1)^2 - 0^2} \end{bmatrix} = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}

What does $\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ really tell us? In order to visualize it, let me introduce another way of visualizing $f$ , called a contour plot.

fig = make_3D_contour(
    f = lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    lim = 2,
    xaxis_title = 'x₁',
    yaxis_title = 'x₂',
    title=r'$$\text{Contour plot of } f(\vec x) = x_1 e^{-x_1^2 - x_2^2}$$',
)

fig.update_layout(width=700, height=700)

Loading...

I think of the contour plot as a birds-eye view 🦅 of $f$ when you look at it from above. when you look at the surface from above. Notice the correspondence between the colors in both graphs.

The circle-like traces in the contour plot are called level curves; they represent slices through the surface at a constant height. On the right, the circle labeled 0.1 represents the set of points where $f(x_1, x_2) = 0.1$ .

Visualizing the fact that $\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ is easier to do in the contour plot, since the contour plot is 2-dimensional, like the gradient vector is. Remember that red values are high and blue values are low.

import numpy as np

def plot_gradient_on_contour(f, dfx1, dfx2, point, lim=2, **kwargs):
    # Set a longer arrow for demonstration when using the (-1, 0) point
    is_minus1_0 = np.allclose(point, (-1, 0))

    # Remove 'arrow_len' from kwargs before passing into make_3D_contour
    # since make_3D_contour does not accept 'arrow_len'
    filtered_kwargs = dict(kwargs)
    if 'arrow_len' in filtered_kwargs:
        filtered_kwargs.pop('arrow_len')

    # Don't add 'arrow_len' for make_3D_contour; let 3D surface handle it instead
    return make_3D_contour(
        f=f,
        dfx1=dfx1,
        dfx2=dfx2,
        lim=lim,
        with_gradient=True,
        grad_point=point,
        **filtered_kwargs
    )

def plot_gradient_on_surface(f, dfx1, dfx2, point, lim=2, **kwargs):
    """
    Plot 3D surface of f, mark the given point, and plot the gradient vector
    as a dotted arrow (dashed line with a cone/arrows) coming out of that point.

    Args:
        f: function of two variables (x, y)
        dfx1: function, df/dx1(x, y)
        dfx2: function, df/dx2(x, y)
        point: tuple/list/np.array of shape (2,) representing the location (x0, y0)
        lim: limits for the plot axes
        **kwargs: keyword args for customization

    Returns:
        Plotly figure.
    """
    fig = make_3D_surface(
        f=f,
        lim=lim,
        xaxis_title=kwargs.get('xaxis_title', 'x₁'),
        yaxis_title=kwargs.get('yaxis_title', 'x₂'),
        zaxis_title=kwargs.get('zaxis_title', 'f(x₁, x₂)'),
        title=kwargs.get('surface_title', '3D Surface with Gradient Arrow')
    )
    x0, y0 = point
    z0 = f(x0, y0)

    # Compute the gradient vector at the given point
    dx = dfx1(x0, y0)
    dy = dfx2(x0, y0)

    # Draw the starting (base) point as a marker
    fig.add_scatter3d(
        x=[x0], y=[y0], z=[z0],
        mode='markers',
        marker=dict(size=8, color='gold'),
        showlegend=False
    )

    # Draw the gradient vector as a dotted (dashed) arrow out of that point
    # Make the arrow longer if the gradient is small, as at (-1, 0)
    is_minus1_0 = np.allclose([x0, y0], [-1, 0])
    arrow_len = 1 # kwargs.get('arrow_len', 0.5)
    if is_minus1_0:
        arrow_len = 1.0  # Draw a relatively long arrow for this specific point
    arrow_color = kwargs.get('arrow_color', 'gold')

    # Use actual gradient magnitude for visual length, scaled by arrow_len
    grad_norm = (dx**2 + dy**2) ** 0.5
    if grad_norm == 0:
        ax, ay = 0, 0
    else:
        # Scale the actual gradient components by arrow_len to control overall scale
        ax = dx * arrow_len
        ay = dy * arrow_len

    # print(ax, ay)

    # Approximate change in z using the tangent plane direction: dz = gradient dot (Δx, Δy)
    dz = dx * ax + dy * ay

    # Draw a dotted line for the "arrow body"
    fig.add_trace(dict(
        type='scatter3d',
        x=[x0, x0 + ax],
        y=[y0, y0 + ay],
        z=[z0, z0 + dz],
        mode='lines',
        line=dict(color=arrow_color, width=5, dash='dot'),
        showlegend=False
    ))

    # Draw a cone (arrowhead) at the tip
    fig.add_trace(
        dict(
            type="cone",
            x=[x0 + ax],
            y=[y0 + ay],
            z=[z0 + dz],
            u=[ax],
            v=[ay],
            w=[dz],
            showscale=False,
            colorscale=[[0, arrow_color],[1, arrow_color]],
            sizemode="absolute",
            sizeref=0.18 * arrow_len,
            anchor="tip"
        )
    )

    # Optionally, plot a marker at the tip of the arrow for emphasis
    fig.add_scatter3d(
        x=[x0 + ax], y=[y0 + ay], z=[z0 + dz],
        mode='markers',
        marker=dict(size=2, color=arrow_color, symbol='diamond'),
        showlegend=False
    )

    return fig


def plot_gradient_side_by_side(f, dfx1, dfx2, point, lim=2, **kwargs):
    return show_surface_and_contour_side_by_side(
        f=f,
        dfx1=dfx1,
        dfx2=dfx2,
        lim=lim,
        with_gradient=True,
        grad_point=point,
        **kwargs
    )

# Correct usage: Pass only the functions for gradient computation and the point.
fig = plot_gradient_on_contour(
    f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
    dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
    point=(-1, 0),
    xaxis_title='x₁',
    yaxis_title='x₂',
)

# Ensure the grid is square by setting aspectmode='equal' for both axes.
fig.update_layout(
    width=700,
    height=700,
)

fig

Loading...

At the point $\begin{bmatrix} -1 \\ 0 \end{bmatrix}$ , which is at the tail of the vector drawn in gold, $f$ is near the global minimum, meaning there are lots of directions in which we can move to increase $f$ . But, the gradient vector at this point is $\begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ , which points in the direction of steepest ascent starting at $\begin{bmatrix} -1 \\ 0 \end{bmatrix}$ . The gradient describes the “quickest way up”.

As another example, consider the fact that $\nabla f\left(\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix}\right) \approx \begin{bmatrix} -0.347 \\ 0.204 \end{bmatrix}$ .

fig = plot_gradient_on_contour(
    f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
    dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
    point=(1.25, -0.5),
    xaxis_title='x₁',
    yaxis_title='x₂',
)

fig.update_layout(
    width=700,
    height=700
)

Loading...

Again, the gradient at $\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix}$ gives us the direction in which $f$ is increasing the quickest at that very point. If we move even a little bit in any direction (in the direction of the gradient or some other direction), the gradient will change.

As you might guess, to find the critical points of a function – that is, places where it is neither increasing nor decreasing – we need to find points where the gradient is zero. Hold that thought.

Examples¶

More typically, the functions we’ll need to take the gradient of will themselves be defined in terms of matrix and vector operations. In all of these examples, remember that we’re working with vector-to-scalar functions.

Example: Dot Product¶

Let $\vec a \in \mathbb{R}^n$ be some fixed vector (the equivalent of a constant in this context). Let’s find the gradient of

f(\vec x) = \vec a \cdot \vec x

I find it helpful to think about $f(\vec x)$ in its expanded form,

f(\vec x) = \vec a \cdot \vec x = a_1 x_1 + a_2 x_2 + \cdots + a_n x_n

Remember, $\nabla f(\vec x)$ contains all of the partial derivatives of $f$ , which we now need to compute.

What is $\frac{\partial f}{\partial x_1}$ ? To me, that looks like $a_1$ , since the first term is $a_1 x_1$ and none of the other terms involve $x_1$ .
Similarly, $\frac{\partial f}{\partial x_2} = a_2$ .
In general, $\frac{\partial f}{\partial x_i} = a_i$ .

Putting these together, we get

\nabla f(\vec x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = \vec a

Example: Norm and Chain Rule¶

Here’s an extremely important example that shows up everywhere in machine learning. Find the gradients of:

$f(\vec x) = \lVert \vec x \rVert^2$
$f(\vec x) = \lVert \vec x \rVert$

Solution

As we did in the previous example, we can expand $f(\vec x) = \lVert \vec x \rVert^2$ to get
$f(\vec x) = \vec x \cdot \vec x = x_1^2 + x_2^2 + \cdots + x_n^2$
For each $i$ , $\frac{\partial f}{\partial x_i} = 2x_i$ . So,
$\nabla f(\vec x) = \begin{bmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{bmatrix} = \boxed{2\vec x}$
Think of this as the equivalent of the “power rule” for vectors.
There are two ways to find the gradient of $f(\vec x) = \lVert \vec x \rVert$ : directly, or by using the chain rule. It’s not immediately obvious how the chain rule should work here, so we’ll start with the direct method and reason about how the chain rule may arise.
Direct method: Let’s start by expanding $f(\vec x) = \lVert \vec x \rVert$ like we did above.
$f(\vec x) = \sqrt{\vec x \cdot \vec x} = (\vec x \cdot \vec x)^{1/2} = (x_1^2 + x_2^2 + \cdots + x_n^2)^{1/2}$
For each $i$ , the (regular, scalar-to-scalar) chain rule tells us that
$\frac{\partial f}{\partial x_i} = \frac{1}{2}(x_1^2 + x_2^2 + \cdots + x_n^2)^{-1/2} \cdot 2x_i = \frac{x_i}{\sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}} = \frac{x_i}{\lVert \vec x \rVert}$
So,
$\nabla f(\vec x) = \begin{bmatrix} \frac{x_1}{\lVert \vec x \rVert} \\ \frac{x_2}{\lVert \vec x \rVert} \\ \vdots \\ \frac{x_n}{\lVert \vec x \rVert} \end{bmatrix} = \boxed{\frac{\vec x}{\lVert \vec x \rVert}}$
Chain rule method: Let me start by writing $f(\vec x)$ in terms of a composition of two functions.
$f(\vec x) = \lVert \vec x \rVert = \sqrt{\lVert \vec x \rVert^2} = h(g(\vec x))$
where $g(\vec x) = \lVert \vec x \rVert^2$ and $h(x) = \sqrt{x}$ . Note that $g: \mathbb{R}^n \to \mathbb{R}$ is the vector-to-scalar function we found the gradient of above, and $h: \mathbb{R} \to \mathbb{R}$ is a scalar-to-scalar function.
Then, generalizing the calculation we did with the first method, we have a “chain rule” for a function $h(g(\vec x))$ (where $h$ is scalar-to-scalar and $g$ is vector-to-scalar):
$\nabla f(\vec x) = \underbrace{\left(\frac{\text{d}h}{\text{d}x}(g (\vec x))\right)}_{h' (g(\vec x))} \nabla g(\vec x)$
Remember that $h(x) = \sqrt{x}$ , so $\frac{\text{d}h}{\text{d}x}(x) = \frac{1}{2\sqrt{x}}$ and $\frac{\text{d}h}{\text{d}x}(g(\vec x)) = \frac{1}{2\sqrt{g(\vec x)}} = \frac{1}{2\lVert \vec x \rVert}$ . This means
$f(\vec x) = \left(\frac{\text{d}h}{\text{d}x}(g (\vec x))\right) \nabla g(\vec x) = \left( \frac{1}{2\lVert \vec x \rVert} \right) 2 \vec x = \frac{\vec x}{\lVert \vec x \rVert}$
which is what we found earlier.

Example: Norm to an Exponent¶

Find the gradient of $f(\vec x) = \lVert \vec x \rVert^p$ , where $p$ is some real number.

Solution

We can treat this as a composition of two functions, $g(\vec x) = \lVert \vec x \rVert$ and $h(x) = x^p$ , and use the chain rule introduced in the solution to the previous example.

$\frac{\text{d}h}{\text{d}x}(x) = p x^{p-1}$ and $\nabla g(\vec x) = \frac{\vec x}{\lVert \vec x \rVert}$ . Putting these together yields

\begin{align*} \nabla f(\vec x) &= \left( \frac{\text{d}h}{\text{d}x}(g (\vec x)) \right) \nabla g(\vec x) \\ &= p\, g(\vec x)^{p-1} \frac{\vec x}{\lVert \vec x \rVert} \\ &= p\, \lVert \vec x \rVert^{p-1} \frac{\vec x}{\lVert \vec x \rVert} \\ &= p\, \lVert \vec x \rVert^{p-2} \vec x \end{align*}

Example: Log Sum Exp¶

If $\vec x \in \mathbb{R}^n$ , we can define the log sum exp function as

f(\vec x) = \log \left( \sum_{i=1}^n e^{x_i} \right)

What is $\nabla f(\vec x)$ ? (The answer is called the softmax function, and comes up all the time in machine learning, when we want our models to output predicted probabilities in a classification problem.)

Solution

Let’s look at the partial derivatives with respect to each $x_i$ .

\frac{\partial f}{\partial x_i} = \frac{\partial}{\partial x_i} \left( \log \left( \sum_{j=1}^n e^{x_j} \right) \right) = \left(\frac{1}{\sum_{j=1}^n e^{x_j}} \right) \frac{\partial}{\partial x_i} \left( \sum_{j=1}^n e^{x_j} \right) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}

Then,

\nabla f(\vec x) = \begin{bmatrix} \frac{e^{x_1}}{\sum_{j=1}^n e^{x_j}} \\ \frac{e^{x_2}}{\sum_{j=1}^n e^{x_j}} \\ \vdots \\ \frac{e^{x_n}}{\sum_{j=1}^n e^{x_j}} \end{bmatrix} = \frac{1}{\sum_{j=1}^n e^{x_j}} \begin{bmatrix} e^{x_1} \\ e^{x_2} \\ \vdots \\ e^{x_n} \end{bmatrix}

There isn’t really a way to simplify the expression using matrix-vector operations, so I’ll leave it as-is. As mentioned above, the gradient we’re looking at is called the softmax function. The softmax function maps $\mathbb{R}^n \to \mathbb{R}^n$ , meaning its a vector-to-vector function.

Let’s suppose we have the matrix $\begin{bmatrix} 3 \\ 5 \\ -1 \end{bmatrix}$ . What does passing it through the softmax function yield?

\text{softmax}\left(\begin{bmatrix} 3 \\ {\color{orange}5} \\ -1 \end{bmatrix}\right) = \frac{1}{e^3 + e^{5} + e^{-1}}\begin{bmatrix} e^3 \\ e^5 \\ e^{-1} \end{bmatrix} \approx \begin{bmatrix} 0.119 \\ {\color{orange}0.879} \\ 0.0002 \end{bmatrix}

The output vector has the same number of elements as the input vector, but each element is between 0 and 1, and the sum of elements is 1, meaning that we can interpret the outputted vector as a probability distribution. Larger values in the output correspond to larger values in the input, and almost all of the “mass” is concentrated at the maximum element of the input vector (position 2), hence the name “soft” max. (The “hard” max might be $\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$ in this case.)

Example: Quadratic Forms¶

Suppose $x \in \mathbb{R}^n$ and $A$ is an $n \times n$ matrix. The function

f(\vec x) = \vec x^T A \vec x

is called a quadratic form, and its gradient is given by

\nabla f(\vec x) = (A + A^T) \vec x

We won’t directly cover the proof of this formula here; one place to find it is here. Instead, we’ll focus our energy on understanding how it works, since it’s extremely important.

Let $A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ . Expand out $f(\vec x) = \vec x^T A \vec x$ and compute $\nabla f(\vec x)$ directly by computing partial derivatives, and verify that the result you get matches the formula above.
In quadratic forms, we typically assume that $A$ is symmetric, meaning that $A = A^T$ . Why do you think this assumption is made (what does it help with)?
- Hint: Let $A = \begin{bmatrix} 3 & 2 \\ 6 & 1 \end{bmatrix}$ and $B = \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix}$ . Compute $\nabla (\vec x^T A \vec x)$ and $\nabla (\vec x^T B \vec x)$ .
If $A$ is any symmetric $n \times n$ matrix, what is $\nabla f(\vec x)$ ?
Suppose $A$ is symmetric and $n \times n$ , $\vec b \in \mathbb{R}^n$ , and $c \in \mathbb{R}$ . Find the gradient of
$f(\vec x) = \vec x^T A \vec x + \vec b \cdot \vec x + c$

Solution

If $A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ , then
$\begin{align*} f(\vec x) &= \vec x^T A \vec x \\ &= \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ &= \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} a x_1 + b x_2 \\ c x_1 + d x_2 \end{bmatrix} \\ &= a x_1^2 + (b + c) x_1 x_2 + d x_2^2 \end{align*}$
Then,
$\frac{\partial f}{\partial x_1} = 2a {\color{#3d81f6}x_1} + (b + c) {\color{#3d81f6}x_2}, \quad \frac{\partial f}{\partial x_2} = (b + c) {\color{#3d81f6}x_1} + 2d {\color{#3d81f6}x_2}$
$\nabla f(\vec x) = \begin{bmatrix} 2a {\color{#3d81f6}x_1} + (b + c) {\color{#3d81f6}x_2} \\ (b + c) {\color{#3d81f6}x_1} + 2d {\color{#3d81f6}x_2} \end{bmatrix} = \begin{bmatrix} 2a & b + c \\ b + c & 2d \end{bmatrix} \begin{bmatrix} {\color{#3d81f6}x_1} \\ {\color{#3d81f6}x_2} \end{bmatrix} = (A + A^T) \vec x$
since $A^T = \begin{bmatrix} a & c \\ b & d \end{bmatrix}$ , meaning $A + A^T = \begin{bmatrix} 2a & b + c \\ b + c & 2d \end{bmatrix}$ .
For a particular quadratic form, there are infinitely many choices of matrices $A$ that represent it. To illustrate, let’s look at $A = \begin{bmatrix} 3 & 2 \\ 6 & 1 \end{bmatrix}$ and $B = \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix}$ as provided in the hint.
$\vec x^T A \vec x = \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} 3 & 2 \\ 6 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 3 x_1^2 + (2 + 6) x_1 x_2 + x_2^2$
$\vec x^T B \vec x = \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 3 x_1^2 + (4 + 4) x_1 x_2 + x_2^2$
Note that both $\vec x^T A \vec x$ and $\vec x^T B \vec x$ are equal to the expression $3x_1^2 + 8x_1x_2 + x_2^2$ . In fact, any matrix of the form $\begin{bmatrix} 3 & b \\ c & 1 \end{bmatrix}$ where $b + c = 8$ would produce the same quadratic form.
So, to avoid this issue of having infinitely many choices of the matrix $A$ , we pick the symmetric matrix $A$ , where $A = A^T$ . As we’re about to see, this choice of $A$ simplifies the calculation of the gradient.
If $A$ is any symmetric $n \times n$ matrix, then $A = A^T$ , and $A + A^T = 2A$ . So,
$\nabla (\vec x^T A \vec x) = (A + A^T) \vec x = 2A \vec x$
This is also an important rule; don’t forget it.
Think of $f(\vec x) = \vec x^T A \vec x + \vec b \cdot \vec x + c$ as the matrix-vector equivalent of a quadratic function, $ax^2 + bx + c$ . The derivative of $ax^2 + bx + c$ is $2ax + b$ . Check out what the gradient of $f(\vec x)$ ends up being!

\begin{align*} f(\vec x) &= \vec x^T A \vec x + \vec b \cdot \vec x + c \\ \nabla f(\vec x) &= (A + A^T) \vec x + \vec b \\ &= 2A \vec x + \vec b \qquad \text{(since $A$ is symmetric)} \end{align*}

Summary of Important Gradient Rules¶

These are the core rules you need to know moving forward, not just because we’re about to use them in an important proof, but because they’ll come up repeatedly in your future machine learning work.

Function	Name	Gradient
$f(\vec x) = \vec a \cdot \vec x$	dot product	$\nabla f(\vec x) = \vec a$
$f(\vec x) = \lVert \vec x \rVert^2$	squared norm	$\nabla f(\vec x) = 2\vec x$
$f(\vec x) = \vec x^T A \vec x$	quadratic form	$\nabla f(\vec x) = (A + A^T) \vec x$ if $A$ is symmetric, $\nabla f(\vec x) = 2A \vec x$

Optimization¶

In the calculus of scalar-to-scalar functions, we have a well-understood procedure for finding the extrema of a function. The general strategy is to take the derivative, set it to zero, and solve for the inputs (called critical points) that satisfy that condition. To be thorough, we’d perform a second derivative test to check whether each critical point is a maximum, minimum, or neither.

In the land of vector-to-scalar functions, the equivalent is to solve for where the gradient is zero, which corresponds to finding where all partial derivatives are zero. Assessing whether we’ve arrived at a maximum or minimum is more difficult to do in the vector-to-scalar case, and we will save a discussion of this for Chapter 4.2.

As an example, consider

f(\vec x) = \vec x^T \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix} \vec x + \begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \vec x + 3

As we computed earlier, the gradient of $f(\vec x) = \vec x^T A \vec x + \vec b \cdot \vec x + c$ is $\nabla f(\vec x) = 2A \vec x + \vec b$ for symmetric $A$ . So,

\nabla f(\vec x) = 2 \begin{bmatrix} 3 & 4 \\ 4 & 1 \end{bmatrix} \vec x + \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} 6x_1 + 8x_2 + 1 \\ 8x_1 + 2x_2 + 2 \end{bmatrix}

To find the critical points, we set the gradient to zero and solve the resulting system. We can also accomplish this by using the inverse of $A$ , if we happen to have it:

\nabla f(\vec x) = 0 \implies 2 A \vec x + \vec b = 0 \implies \vec x^* = -\frac{1}{2}A^{-1} \vec b

Either way, we find that $\vec x^* = \begin{bmatrix} -7/26 \\ 1/13 \end{bmatrix}$ satisfies $\nabla f(\vec x^*) = 0$ , which corresponds to a local minimum.

import numpy as np
import plotly.graph_objs as go

# Define the function
def f(x, y):
    return 3 * x ** 2 + 8 * x * y + y ** 2 + x + 2 * y + 3

fig = plot_gradient_on_surface(
    f = f,
    lim = 5,
    xaxis_title = 'x₁',
    yaxis_title = 'x₂',
    zaxis_title = 'f(x₁, x₂)',
    title='',
    dfx1 = lambda x, y: 6 * x + 8 * y + 1,
    dfx2 = lambda x, y: 8 * x + 2 * y + 2,
    point = np.array([-7/26, 1/13]),
)

fig.update_layout(title='', scene_camera=dict(eye=dict(x=1, y=2, z=2)))

# Annotate the local minimum point
x_star = -7/26
y_star = 1/13
z_star = f(x_star, y_star)
fig.add_trace(
    go.Scatter3d(
        x=[x_star],
        y=[y_star],
        z=[z_star],
        mode='markers+text',
        marker=dict(size=12, color='gold', symbol='circle'),
        text=["local minimum"],
        textposition="top center",
        textfont=dict(color='gold', size=14),
        name="local minimum"
    )
)

Loading...

Minimizing Mean Squared Error¶

Remember, the goal of this section is to minimize mean squared error,

R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

In the general case, $X$ is an $n \times (d + 1)$ matrix, $y \in \mathbb{R}^n$ , and $\vec w \in \mathbb{R}^{d+1}$ .

We’re now equipped with the tools to minimize $R_\text{sq}(\vec w)$ by taking its gradient and setting it to zero. Hopefully, we end up with the same conditions on $\vec w^*$ that we derived in Chapter 2.10.

In the most recent example we saw, the optimal vector $\vec x^*$ corresponded to a local minimum. We know that we won’t run into such an issue here since $R_\text{sq}(\vec w)$ cannot output a negative number (it is the average of squared losses), so its minimum possible output is 0, meaning that there will be some global minimizer $\vec w^*$ .

Let’s start by rewriting the squared norm as a dot product and eventually matrix multiplication.

\begin{align*}R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 &= \frac{1}{n} (\vec y - X \vec w) \cdot (\vec y - X \vec w) \\ &= \underbrace{\frac{1}{n} (\vec y - X \vec x)^T (\vec y - X \vec w)}_{\text{since } \vec u \cdot \vec v = \vec u^T \vec v} \\ &= \frac{1}{n} \left( \vec y^T - (X \vec w)^T \right) (\vec y - X \vec w) \\ &= \frac{1}{n} \left( \vec y^T \vec y - {\color{orange}\vec y^T X \vec w} - {\color{orange}(X \vec w)^T \vec y} + (X \vec w)^T X \vec w \right)\end{align*}

Let’s focus on the two terms in orange. They are both equal: they are both the dot product of $\vec y$ and $X \vec w$ . Ideally, I want to express each term as a dot product of $\vec w$ with something, since I’m taking the gradient with respect to $\vec w$ . Remember, the dot product is a scalar, and the transpose of a scalar is just that same scalar. So,

\vec y^T X \vec w = (\vec y^T X \vec w)^T = \vec w^T X^T \vec y = \vec w^T (X^T \vec y)

so, performing this substitution in for both orange terms gives us

\begin{align*}R_\text{sq}(\vec w) &= \frac{1}{n} \left( \vec y^T \vec y - {\color{orange}\vec w^T (X^T \vec y)} - {\color{orange}\vec w^T X^T \vec y} + \vec w^T (X^T X) \vec w \right) \\ &= \frac{1}{n} \left( \vec y^T \vec y - 2 \vec w^T (X^T \vec y) + \vec w^T (X^T X) \vec w \right)\end{align*}

Now, we’re ready to take the gradient, which we’ll do term by term.

$\nabla \left( \vec y^T \vec y \right) = \vec 0$ , since $\vec y^T \vec y$ is a constant with respect to $\vec w$
$\nabla \left( 2 \vec w^T (X^T \vec y) \right) = 2 X^T \vec y$ using the dot product rule, since this is the dot product between $2X^T \vec y$ (a vector) and $\vec w$ (a vector)
$\nabla \left( \vec w^T (X^T X) \vec w \right) = 2X^T X \vec w$ , using the quadratic form rule, since $X^T X$ is a symmetric matrix

Plugging these terms in gives us

\begin{align*}R_\text{sq}(\vec w) &= \frac{1}{n} \left( \vec y^T \vec y - 2 \vec w^T (X^T \vec y) + \vec w^T (X^T X) \vec w \right) \\ \nabla R_\text{sq}(\vec w) &= \frac{1}{n} \left( \nabla \left(\vec y^T \vec y \right) - \nabla \left( 2 \vec w^T (X^T \vec y) \right) + \nabla \left( \vec w^T (X^T X) \vec w \right) \right) \\ &= \frac{1}{n} \left( 0 - 2 X^T \vec y + 2X^T X \vec w \right) \\ &= \boxed{\frac{2}{n} (X^T X \vec w - X^T \vec y)} \end{align*}

Finally, to find the minimizer $\vec w^*$ , we set the gradient to zero and solve.

\frac{2}{n} (X^T X \vec w^* - X^T \vec y) = 0 \implies X^TX \vec w^* = X^T \vec y

Stop me if this feels familiar... these are the normal equations once again! It shouldn’t be a surprise that we ended up with the same conditions on $\vec w^*$ that we derived in Chapter 2.10, since we were solving the same problem.

We’ve now shown that the minimizer of

R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

is given by solving $X^TX \vec w^* = X^T \vec y$ . These equations have a unique solution if $X^TX$ is invertible, and infinitely many solutions otherwise. If $\vec w^*$ satisfies the normal equations, then $X \vec w^*$ is the vector in $\text{colsp}(X)$ that is closest to $\vec y$ . All of that interpretation from Chapter 2.10 and Chapter 3 carry over; we’ve just introduced a new way of finding the solution.

Heads up: In Homework 9, you’ll follow similar steps to minimize a new objective function, that resembles $R_\text{sq}(\vec w)$ but involves another term. There, you’ll minimize

R_\text{ridge}(\vec w) = \lVert \vec y - X \vec w \rVert^2 + \lambda \lVert \vec w \rVert^2

where $\lambda > 0$ is a constant, called the regularization hyperparameter. (Notice the missing $\frac{1}{n}$ .) A good way to practice what you’ve learned (and to get a head start on the homework) is to compute the gradient of $R_\text{ridge}(\vec w)$ and set it to zero. We’ll walk through what the significance of $R_\text{ridge}(\vec w)$ is in the homework.