8.1. The Gradient Vector - EECS 245 Course Notes

The big theme in our course so far has been the three-step modeling recipe, of choosing a model, choosing a loss function, and then minimizing empirical risk (i.e. average loss) to find optimal model parameters.

In Chapter 2.3, we used calculus to find the slope $w_1^*$ and intercept $w_0^*$ that minimized mean squared error,

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

by computing $\frac{\partial R_\text{sq}}{\partial w_0}$ (the partial derivative with respect to $w_0$ ) $\frac{\partial R_\text{sq}}{\partial w_1}$ , setting both to zero, and solving the resulting system of equations.

Then, in Chapters 7.1 and 7.2, we focused on the multiple linear regression model and the squared loss function, which saw us minimize

R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \vec w \cdot \text{Aug}(\vec x_i))^2

where $X$ is the $n \times (d + 1)$ design matrix, $\vec y \in \mathbb{R}^n$ is the observation vector, and $\vec w \in \mathbb{R}^{d+1}$ is the parameter vector we’re trying to pick. In Chapter 6.3, we minimized $R_\text{sq}(\vec w)$ by arguing that the optimal $\vec w^*$ had to create an error vector, $\vec e = \vec y - X \vec w^*$ , that was orthogonal to $\text{colsp}(X)$ , which led us to the normal equations.

It turns out that there’s a way to use our calculus-based approach from Chapter 2.3 to minimize the more general version of $R_\text{sq}$ for any $d$ , that doesn’t involve computing $d$ partial derivatives. To see how this works, we need to define a new object, the gradient vector, which we’ll do here in Chapter 8.1. After we’re familiar with how the gradient vector works, we’ll use it to build a new approach to function minimization, one that works even when there isn’t a closed-form solution for the optimal parameters: that technique is called gradient descent, which we’ll see in Chapter 8.3.

Domain and Codomain¶

As we saw in Chapter 6.2 when we first introduced the concept of the inverse of a matrix, the notation

f: \mathbb{R}^d \to \mathbb{R}^n

means that $f$ is a function whose inputs are vectors with $d$ components and whose outputs are vectors with $n$ components. $\mathbb{R}^d$ is the domain of the function, and $\mathbb{R}^n$ is the codomain. I’ve used $d$ and $n$ to match the notation we’ve used for matrices and linear transformations. In general, if $A$ is an $n \times d$ matrix, then any vector $\vec x$ multiplied by $A$ (on the right) must be in $\mathbb{R}^d$ and the result $A \vec x$ will be in $\mathbb{R}^n$ .

Given this framing, consider the following four types of functions.

Type	Domain and Codomain	Examples
Scalar-to-scalar	$f: \mathbb{R} \to \mathbb{R}$	$f(x) = x^2 + \sin(x)$ $R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2$
Vector-to-scalar	$f: \mathbb{R}^d \to \mathbb{R}$	$f(\vec x) = \vec x^T \vec x$ $f(\vec x) = (x_1 - 1)^2 + (x_1 - x_2)^2 + 3$ $R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$
Scalar-to-vector	$f: \mathbb{R} \to \mathbb{R}^n$	$f(x) = \underbrace{\begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} + x \begin{bmatrix} 3 \\ 0 \\ -4 \end{bmatrix}}_{\text{parametric form of a line}}$ $f(x) = \begin{bmatrix} x^2 + e^{x} \\ -3 \end{bmatrix}$
Vector-to-vector	$f: \mathbb{R}^d \to \mathbb{R}^n$	$\underbrace{f(\vec x) = \begin{bmatrix} 3 & 4 \\ 2 & 0 \\ 0 & 1 \end{bmatrix} \vec x}_{\text{linear transformation}}$ $f(\vec x) = \begin{bmatrix} x_1^2 + x_1x_2 + \cos(x_1^4) \\ 3x_1x_2 + 4 \\ 5x_1 \\ x_2 / x_3 \end{bmatrix}$

The first two types of functions are “scalar-valued”, while the latter two are “vector-valued”. These are not the only types of functions that exist; for instance, the function $f(A) = \text{rank}(A)$ is a matrix-to-scalar function.

The type of function we’re most concerned with at the moment are vector-to-scalar functions, i.e. functions that take in a vector (or equivalently, multiple scalar inputs) and output a single scalar.

R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2

is one such function, and it’s the focus of this section.

Rates of Change¶

Let’s think from the perspectives of rates of change, since ultimately what we’re building towards is a technique for minimizing functions. We’re most familiar with the concept of rates of change for scalar-to-scalar functions.

f(x) = x^2 \sin(x)

then its derivative,

\frac{\text{d}f}{\text{d}x} = 2x \sin(x) + x^2 \cos(x)

itself is a scalar-to-scalar function, which describes how quickly $f$ is changing at any point $x$ in the domain of $f$ . At $x = 3$ , for instance, the instantaneous rate of change is

\frac{\text{d}f}{\text{d}x}(3) = 2\cdot 3 \sin(3) + 3^2 \cos(3) \approx -8.06

meaning that at $x = 3$ , $f$ is decreasing at a rate of (approximately) 8.06 per unit change in $x$ . Perhaps a more intuitive way of thinking about the instantaneous rate of change is to think of it as the slope of the tangent line to $f$ at $x = 3$ .

import numpy as np
from new_grad_utils import plot_function_with_tangent_line

f = lambda x: (x ** 2) * np.sin(x)
f_prime = lambda x: 2 * x * np.sin(x) + x ** 2 * np.cos(x)
x_range = (-6, 6)
y_range = (-6, 6)

fig = plot_function_with_tangent_line(f, f_prime, x_range, y_range, initial_x_point=3, dtick=0.5)
fig.update_layout(width=800, height=600)
fig.show(renderer='notebook');

The steeper the slope, the faster $f$ is changing at that point; the sign of the slope tells us whether $f$ is increasing or decreasing at that point.

In Chapter 2.3, we saw how to compute derivatives of functions that take in multiple scalar inputs, like

f(x, y, z) = x^2 + 2xy + 3xz + 4(y - z)^2

In the language of Chapter 8.1, we’d call such a function a vector-to-scalar function, and might use the notation

f(\vec x) = x_1^2 + 2x_1x_2 + 3x_1x_3 + 4(x_2 - x_3)^2

This function has three partial derivatives, each of which describes the instantaneous rate of change of $f$ with respect to one of its inputs, while holding the other two inputs constant. There’s a good animation of what it means to hold an input constant in Chapter 2.2 that is worth revisiting.

Here,

\frac{\partial f}{\partial x_1} = 2x_1 + 2x_2 + 3x_3, \quad \frac{\partial f}{\partial x_2} = 2x_1 + 8x_2 - 8x_3, \quad \frac{\partial f}{\partial x_3} = 3x_1 - 8x_2 + 8x_3

The big idea of this section, the gradient vector, packages all of these partial derivatives into a single vector. This will allow us to think about the direction in which $f$ is changing, rather than just looking at its rates of change in each dimension independently.

The Gradient Vector¶

Definition: Gradient Vector

Suppose $f: \mathbb{R}^d \to \mathbb{R}$ is a vector-to-scalar function. The gradient vector of $f$ , denoted $\nabla f(\vec x)$ , is the vector in $\mathbb{R}^d$ of partial derivatives of $f$ :

\nabla f(\vec x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{bmatrix}

$\nabla f(\vec x)$ itself is a vector-to-vector function; it takes in a vector $\vec x \in \mathbb{R}^d$ and outputs a new vector in $\mathbb{R}^d$ , describing the rates of change of $f$ along each dimension. The gradient, when evaluated at a point $\vec x_0$ describes the direction of steepest ascent of $f$ at $\vec x_0$ , i.e. the direction in which $f$ is increasing most quickly.

Let’s start with a straightforward example where the partial derivatives are easy to compute. Let

f(\vec x) = x_1^2 + x_2^2 - 3x_1x_2

Then

\frac{\partial f}{\partial x_1} = 2x_1 - 3x_2, \quad \frac{\partial f}{\partial x_2} = 2x_2 - 3x_1

\nabla f(\vec x) = \begin{bmatrix} 2x_1 - 3x_2 \\ 2x_2 - 3x_1 \end{bmatrix}

If we evaluate the gradient at $\vec x = \begin{bmatrix} 2 \\ 1 \end{bmatrix}$ , we get

\nabla f\left(\begin{bmatrix} 2 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} 2(2) - 3(1) \\ 2(1) - 3(2) \end{bmatrix} = \begin{bmatrix} 1 \\ -4 \end{bmatrix}

What does the fact that $\nabla f\left(\begin{bmatrix} 2 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} 1 \\ -4 \end{bmatrix}$ tell us? It tells us that the direction of steepest ascent of $f$ at $\begin{bmatrix} 2 \\ 1 \end{bmatrix}$ is $\begin{bmatrix} 1 \\ -4 \end{bmatrix}$ . To put this into context, let’s consider another example.

Visualizing the Gradient Vector¶

Let’s look at another example and use it to understand what the gradient of a function tells us visually.

Suppose $\vec x \in \mathbb{R}^2$ , and let

f(\vec x) = x_1 e^{-x_1^2 - x_2^2}

import numpy as np
import plotly.graph_objects as go

import numpy as np
import plotly.graph_objects as go

x1_range = np.linspace(-2, 2, 100)
x2_range = np.linspace(-2, 2, 100)
x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
z_values = x1_grid * np.e ** (-x1_grid ** 2 - x2_grid ** 2)

fig = go.Figure(
    data=[
        go.Surface(
            x=x1_grid,
            y=x2_grid,
            z=z_values,
            colorscale='RdBu_r',
            showscale=False,
        )
    ]
)

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title='x₁', gridcolor='#f0f0f0', showbackground=True,
            showline=True, linecolor='black', linewidth=1,
            tickfont=dict(family='Palatino', size=10), backgroundcolor='white'
        ),
        yaxis=dict(
            title='x₂', gridcolor='#f0f0f0', showbackground=True,
            showline=True, linecolor='black', linewidth=1,
            tickfont=dict(family='Palatino', size=10), backgroundcolor='white'
        ),
        zaxis=dict(
            title='f(x₁, x₂)', gridcolor='#f0f0f0', showbackground=True,
            showline=True, linecolor='black', linewidth=1,
            tickfont=dict(family='Palatino', size=10), backgroundcolor='white'
        ),
        aspectratio=dict(x=1, y=1, z=1),
        camera=dict(eye=dict(x=1.5, y=1.5, z=1))
    ),
    width=800,
    height=700,
    margin=dict(l=65, r=50, b=65, t=90),
    font=dict(family='Palatino', size=16, color='#222'),
    paper_bgcolor='white',
    plot_bgcolor='white',
    showlegend=False,
)

fig

To find $\nabla f(\vec x)$ , we need to compute the partial derivatives of $f$ with respect to each component of $\vec x$ . The “input variables” to $f$ are $x_1$ and $x_2$ , so we need to compute $\frac{\partial f}{\partial x_1}$ and $\frac{\partial f}{\partial x_2}$ , but if you’d like, replace $x_1$ and $x_2$ with $x$ and $y$ if it makes the algebra a little cleaner, and then replace $x$ and $y$ with $x_1$ and $x_2$ at the end.

f(\vec x) = x_1 e^{-x_1^2 - x_2^2}

\frac{\partial f}{\partial x_1} = \frac{\partial}{\partial x_1} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{1 \cdot e^{-x_1^2 - x_2^2} + x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_1)}_{\text{product rule}} = (1 - 2x_1^2) e^{-x_1^2 - x_2^2}

\frac{\partial f}{\partial x_2} = \frac{\partial}{\partial x_2} \left( x_1 e^{-x_1^2 - x_2^2} \right) = \underbrace{x_1 \cdot e^{-x_1^2 - x_2^2} \cdot (-2x_2)}_{\text{chain rule}} = -2x_1 x_2 e^{-x_1^2 - x_2^2}

Putting these together, we have

\nabla f(\vec x) = \begin{bmatrix} (1 - 2x_1^2) e^{-x_1^2 - x_2^2} \\ -2x_1 x_2 e^{-x_1^2 - x_2^2} \end{bmatrix}

Remember, $\nabla f(\vec x)$ itself is a function. If we plug in a value of $\vec x$ , we get a new vector back.

\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} (1 - 2(-1)^2) e^{-(-1)^2 - 0^2} \\ -2(-1)(0) e^{-(-1)^2 - 0^2} \end{bmatrix} = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}

What does $\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ really tell us? In order to visualize it, let me introduce another way of visualizing $f$ , called a contour plot.

x1_range = np.linspace(-2, 2, 100)
x2_range = np.linspace(-2, 2, 100)
x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
z_values = x1_grid * np.e ** (-x1_grid ** 2 - x2_grid ** 2)

fig = go.Figure(
    data=[
        go.Contour(
            z=z_values,
            x=x1_range,
            y=x2_range,
            colorscale='RdBu_r',
            showscale=False,
            contours=dict(
                showlabels=True,
                labelfont=dict(size=12, color='white', family='Palatino'),
            ),
        )
    ]
)

fig.update_layout(
    title=r'$$\text{Contour plot of } f(\vec x) = x_1 e^{-x_1^2 - x_2^2}$$',
    width=700,
    height=700,
    margin=dict(l=65, r=50, b=65, t=90),
    font=dict(family='Palatino', size=16, color='#222'),
    paper_bgcolor='white',
    plot_bgcolor='white',
    showlegend=False,
    xaxis=dict(
        title='x₁', showline=True, linecolor='black', linewidth=1,
        showgrid=True, gridcolor='#f0f0f0', tickfont=dict(family='Palatino', size=10)
    ),
    yaxis=dict(
        title='x₂', showline=True, linecolor='black', linewidth=1,
        showgrid=True, gridcolor='#f0f0f0', tickfont=dict(family='Palatino', size=10)
    ),
)

fig

I think of the contour plot as a bird’s-eye view of $f$ when you look at the surface from above. Notice the correspondence between the colors in both graphs.

The circle-like traces in the contour plot are called level curves; they represent slices through the surface at a constant height. On the right, the circle labeled 0.1 represents the set of points where $f(x_1, x_2) = 0.1$ .

Visualizing the fact that $\nabla f\left(\begin{bmatrix} -1 \\ 0\end{bmatrix}\right) = \begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ is easier to do in the contour plot, since the contour plot is 2-dimensional, like the gradient vector is. Remember that red values are high and blue values are low.

import numpy as np
import plotly.graph_objects as go
import plotly.figure_factory as ff

def make_contour_figure(f, lim=2, xaxis_title='x₁', yaxis_title='x₂', title='', contour_kwargs=None, contour_opacity=1.0):
    pad = lim * 1.3
    x1_range = np.linspace(-pad, pad, 150)
    x2_range = np.linspace(-pad, pad, 150)
    x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
    z_values = f(x1_grid, x2_grid)

    if contour_kwargs is None:
        contour_kwargs = {}

    base_contour = dict(
        z=z_values,
        x=x1_range,
        y=x2_range,
        colorscale='RdBu_r',
        showscale=False,
        opacity=contour_opacity,
        contours=dict(
            showlabels=True,
            labelfont=dict(size=12, color='white', family='Palatino'),
        ),
    )
    base_contour.update(contour_kwargs)

    fig = go.Figure(data=[go.Contour(**base_contour)])
    fig.update_layout(
        title=title,
        width=700,
        height=700,
        margin=dict(l=65, r=50, b=65, t=90),
        font=dict(family='Palatino', size=16, color='#222'),
        paper_bgcolor='white',
        plot_bgcolor='white',
        showlegend=False,
        xaxis=dict(
            title=xaxis_title, showline=True, linecolor='black', linewidth=1,
            showgrid=True, gridcolor='#f0f0f0', tickfont=dict(family='Palatino', size=10)
        ),
        yaxis=dict(
            title=yaxis_title, showline=True, linecolor='black', linewidth=1,
            showgrid=True, gridcolor='#f0f0f0', tickfont=dict(family='Palatino', size=10)
        ),
    )
    return fig

def plot_gradient_on_contour(f, dfx1, dfx2, point, lim=2, **kwargs):
    fig = make_contour_figure(
        f=f,
        lim=lim,
        xaxis_title=kwargs.get('xaxis_title', 'x₁'),
        yaxis_title=kwargs.get('yaxis_title', 'x₂'),
        title=kwargs.get('title', f'<span style="color:gold"><b>Gradient Vector</b></span> at Point ({point[0]}, {point[1]})'),
        contour_kwargs=kwargs.get('contour_kwargs'),
    )

    x0, y0 = point
    dx = dfx1(x0, y0)
    dy = dfx2(x0, y0)
    x_end = x0 + dx
    y_end = y0 + dy
    arrow_color = kwargs.get('arrow_color', 'gold')

    fig.add_trace(go.Scatter(
        x=[x0, x_end],
        y=[y0, y_end],
        mode='lines',
        line=dict(color=arrow_color, width=5),
        showlegend=False,
    ))
    fig.add_trace(go.Scatter(
        x=[x0],
        y=[y0],
        mode='markers',
        marker=dict(size=10, color=arrow_color),
        showlegend=False,
    ))
    fig.add_annotation(
        x=x_end,
        y=y_end,
        ax=x0,
        ay=y0,
        xref='x',
        yref='y',
        axref='x',
        ayref='y',
        showarrow=True,
        arrowhead=5,
        arrowsize=1,
        arrowwidth=3,
        arrowcolor=arrow_color,
    )
    return fig

def plot_gradient_field_on_contour(
    f,
    dfx1,
    dfx2,
    lim=2,
    grid_size=13,
    scale=0.12,
    arrow_scale=0.18,
    normalize=True,
    zero_points=None,
    **kwargs
):
    fig = make_contour_figure(
        f=f,
        lim=lim,
        xaxis_title=kwargs.get('xaxis_title', 'x₁'),
        yaxis_title=kwargs.get('yaxis_title', 'x₂'),
        title=kwargs.get('title', 'Gradient Field'),
        contour_kwargs=kwargs.get('contour_kwargs'),
        contour_opacity=kwargs.get('contour_opacity', 0.7),
    )

    margin = scale * (1 + arrow_scale) * 1.1
    x1 = np.linspace(-lim + margin, lim - margin, grid_size)
    x2 = np.linspace(-lim + margin, lim - margin, grid_size)
    x1_grid, x2_grid = np.meshgrid(x1, x2)
    grad_x = dfx1(x1_grid, x2_grid)
    grad_y = dfx2(x1_grid, x2_grid)

    if normalize:
        grad_mag = np.sqrt(grad_x ** 2 + grad_y ** 2)
        grad_x = np.divide(grad_x, grad_mag, out=np.zeros_like(grad_x), where=grad_mag > 0)
        grad_y = np.divide(grad_y, grad_mag, out=np.zeros_like(grad_y), where=grad_mag > 0)

    quiver = ff.create_quiver(
        x1_grid.ravel(),
        x2_grid.ravel(),
        grad_x.ravel(),
        grad_y.ravel(),
        scale=scale,
        arrow_scale=arrow_scale,
        line=dict(color=kwargs.get('arrow_color', 'black'), width=2),
    )

    for trace in quiver.data:
        trace.showlegend = False
        fig.add_trace(trace)

    if zero_points is not None and len(zero_points) > 0:
        zero_x = [point[0] for point in zero_points]
        zero_y = [point[1] for point in zero_points]
        fig.add_trace(go.Scatter(
            x=zero_x,
            y=zero_y,
            mode='markers',
            marker=dict(size=9, color=kwargs.get('zero_point_color', kwargs.get('arrow_color', 'black'))),
            showlegend=False,
        ))

    fig.update_xaxes(range=[-lim, lim], autorange=False)
    fig.update_yaxes(range=[-lim, lim], scaleanchor='x', scaleratio=1, autorange=False)

    return fig

fig = plot_gradient_on_contour(
    f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
    dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
    point=(-1, 0),
    xaxis_title='x₁',
    yaxis_title='x₂',
)

fig.update_layout(
    width=700,
    height=700,
)

fig

At the point $\begin{bmatrix} -1 \\ 0 \end{bmatrix}$ , which is at the tail of the vector drawn in gold, $f$ is near the global minimum, meaning there are lots of directions in which we can move to increase $f$ . But, the gradient vector at this point is $\begin{bmatrix} -1/e \\ 0 \end{bmatrix}$ , which points in the direction of steepest ascent starting at $\begin{bmatrix} -1 \\ 0 \end{bmatrix}$ . The gradient describes the “quickest way up”.

As another example, consider the fact that $\nabla f\left(\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix}\right) \approx \begin{bmatrix} -0.347 \\ 0.204 \end{bmatrix}$ .

fig = plot_gradient_on_contour(
    f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
    dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
    point=(1.25, -0.5),
    xaxis_title='x₁',
    yaxis_title='x₂',
)

fig.update_layout(
    width=700,
    height=700
)

Again, the gradient at $\begin{bmatrix} 1.25 \\ -0.5 \end{bmatrix}$ gives us the direction in which $f$ is increasing the quickest at that very point. If we move even a little bit in any direction (in the direction of the gradient or some other direction), the gradient will change.

One way to see this more globally is to draw many gradient vectors at once, forming a gradient field.

fig = plot_gradient_field_on_contour(
    f=lambda x, y: x * np.e ** (-x ** 2 - y ** 2),
    dfx1=lambda x, y: (1 - 2 * x ** 2) * np.e ** (-x ** 2 - y ** 2),
    dfx2=lambda x, y: -2 * x * y * np.e ** (-x ** 2 - y ** 2),
    lim=2,
    grid_size=13,
    scale=0.3,
    arrow_scale=0.3,
    normalize=False,
    zero_points=[(-1 / np.sqrt(2), 0), (1 / np.sqrt(2), 0)],
    xaxis_title='x₁',
    yaxis_title='x₂',
    title=r'$$\text{Gradient field of } f(\vec x) = x_1 e^{-x_1^2 - x_2^2}$$',
    arrow_color='black',
    zero_point_color='black',
)

fig.update_xaxes(range=[-1.9, 1.9], autorange=False)
fig.update_yaxes(range=[-1.9, 1.9], scaleanchor='x', scaleratio=1, autorange=False)

fig.update_layout(
    width=620,
    height=620,
    margin=dict(l=40, r=20, b=45, t=75),
)

fig

Each arrow in this gradient field shows the gradient vector at a different point, and the arrow lengths are proportional to the magnitude of the gradient there. Longer arrows indicate places where $f$ increases more steeply, while shorter arrows indicate places where the rate of increase is smaller. Note that the arrows don’t necessarily all point to the “top” of the function, located at $\begin{bmatrix} \frac{\sqrt{2}}{2} \\ 0 \end{bmatrix} \approx \begin{bmatrix} 0.707 \\ 0 \end{bmatrix}$ – instead, they point in the direction of steepest ascent at each point.

At the two critical points $\begin{bmatrix} -\frac{\sqrt{2}}{2} \\ 0 \end{bmatrix}$ and $\begin{bmatrix} \frac{\sqrt{2}}{2} \\ 0 \end{bmatrix}$ , the gradient really is $\begin{bmatrix} 0 \\ 0 \end{bmatrix}$ , so those locations are shown as points instead of arrows.

As you might guess, to find the critical points of a function - that is, places where it is neither increasing nor decreasing - we need to find points where the gradient is zero. Hold that thought.

In our course, most of the functions we’ll work with won’t be defined in terms of the individual components of the input vector $\vec x$ , like in the case of $f(\vec x) = x_1^2 + x_2^2 - 3x_1x_2$ . Instead, they’ll be defined in terms of matrix-vector operations, like $f(\vec x) = \vec x^T A \vec x$ . Chapter 8.2 explores how to compute gradients of functions like this.