6.3. Projecting onto the Column Space

Approximating using a Single Vector¶

In Chapter 3.4, we introduced the approximation problem, which asked:

Among all vectors of the form $k \color{#3d81f6} \vec x$ , which one is closest to ${\color{orange}\vec y}$ ?

We now know the answer is the vector $\color{#004d40} \vec p$ , where

{\color{#004d40} \vec p} = \left( \frac{{\color{orange}\vec y} \cdot \color{#3d81f6} \vec x}{{\color{#3d81f6}\vec x} \cdot {\color{#3d81f6}\vec x}} \right) \color{#3d81f6} \vec x

$\color{#004d40} \vec p$ is called the orthogonal projection of $\color{orange} \vec y$ onto $\color{#3d81f6} \vec x$ .

Note that I’ve used $\color{orange} \vec y$ and $\color{#3d81f6} \vec x$ here rather than $\color{orange} \vec u$ and $\color{#3d81f6} \vec v$ , just to make the notation more consistent with the notation we’ll use as we move back into the world of machine learning.

# This chunk must be in the first plotting cell of each notebook in order to guarantee that the mathjax script is loaded.

import plotly
from IPython.display import display, HTML

plotly.offline.init_notebook_mode()
display(HTML(
    '<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_SVG"></script>'
))

import numpy as np
import plotly.graph_objects as go

u = np.array([3, 1])
v = np.array([3, -2])

# Calculate k^* where error is orthogonal to v
k_star = np.dot(u, v) / np.dot(v, v)
p = k_star * v  # p = k^* v
e = u - p       # error vector (orthogonal to v)

def create_vector_trace(coordinates, color, label, opacity=1.0):
    x, y = coordinates
    return go.Scatter(
        x=[0, x], 
        y=[0, y],
        mode='lines+markers',
        line=dict(color=color, width=4),
        marker=dict(
            size=[0, 16],
            color=[color, color],
            symbol=['circle', 'arrow'],
            angleref='previous'
        ),
        hovertemplate='(%{x}, %{y})<extra></extra>',
        showlegend=False,
        name=label,
        opacity=opacity
    )

def create_error_trace(start_coords, end_coords, color, label, opacity=1.0):
    return go.Scatter(
        x=[start_coords[0], end_coords[0]], 
        y=[start_coords[1], end_coords[1]],
        mode='lines+markers',
        line=dict(color=color, width=3, dash='dot'),
        marker=dict(
            size=[0, 12],
            color=[color, color],
            symbol=['circle', 'arrow'],
            angleref='previous'
        ),
        hovertemplate='(%{x}, %{y})<extra></extra>',
        showlegend=False,
        name=label,
        opacity=opacity
    )

def plot_static_projection():
    traces = []

    traces.append(create_vector_trace(tuple(u), 'orange', r'$\vec y$'))
    traces.append(create_vector_trace(tuple(v), '#3d81f6', r'$\vec x$', opacity=0.5))
    traces.append(create_vector_trace(tuple(p), '#004d40', r'$\vec p = k^* \vec x$'))
    traces.append(create_error_trace(tuple(p), tuple(u), '#d81b60', r'$\vec e$'))

    v_unit = v / np.linalg.norm(v)
    e_unit = e / np.linalg.norm(e)
    marker_size = 0.12

    p0 = p
    p1 = p0 + -v_unit * marker_size
    p2 = p1 + e_unit * marker_size

    right_angle_trace = go.Scatter(
        x=[p1[0], p2[0], p2[0] - (p1[0] - p0[0])],
        y=[p1[1], p2[1], p2[1] - (p1[1] - p0[1])],
        mode='lines',
        line=dict(color="#222", width=2),
        showlegend=False,
        hoverinfo='skip'
    )
    traces.append(right_angle_trace)
    
    min_x = min(0, p[0], u[0], v[0]) - 0.5
    max_x = max(0, p[0], u[0], v[0]) + 1.5
    min_y = min(0, p[1], u[1], v[1]) - 0.5
    max_y = max(0, p[1], u[1], v[1]) + 1.5

    top_right_corner = (max_x - 0.2, max_y - 0.2)

    fig = go.Figure(data=traces)
    
    fig.add_annotation(
        x=u[0],
        y=u[1] + 0.13,
        text=r"$\vec y$",
        showarrow=False,
        font=dict(size=18, family="Palatino, serif", color="orange"),
        align="center"
    )
    fig.add_annotation(
        x=v[0] + 0.13,
        y=v[1] - 0.13,
        text=r"$\vec x$",
        showarrow=False,
        font=dict(size=18, family="Palatino, serif", color="#3d81f6"),
        align="center"
    )
    fig.add_annotation(
        x=p[0] - 1,
        y=p[1] + 0.2,
        text=fr"$\vec p = k^* \vec x$",
        showarrow=False,
        font=dict(size=18, family="Palatino, serif", color="#004d40"),
        align="left"
    )
    fig.add_annotation(
        x=p[0] - 1,
        y=p[1] + 0.2,
        text=fr"$\vec p = k^* \vec x$",
        showarrow=False,
        font=dict(size=18, family="Palatino, serif", color="#004d40"),
        align="left"
    )
    fig.add_annotation(
        x=(p[0] + u[0]) / 2 + 0.21,
        y=(p[1] + u[1]) / 2 + 0.03,
        text=r"$\vec e$",
        showarrow=False,
        font=dict(size=18, family="Palatino, serif", color="#d81b60"),
        align="center"
    )
    fig.update_layout(
        width=480,
        height=420,
        yaxis_scaleanchor="x",
        margin=dict(l=10, r=10, t=10, b=10),
        font=dict(family="Palatino, serif"),
        plot_bgcolor="white",
        paper_bgcolor="white",
    )
    fig.update_xaxes(
        range=[min_x, max_x],
        showticklabels=False,
        gridcolor="#fff",
        zerolinecolor="#fff"
    )
    fig.update_yaxes(
        range=[min_y, max_y],
        showticklabels=False,
        gridcolor="#fff",
        zerolinecolor="#fff"
    )
    return fig

plot_static_projection().show(renderer='png', scale=3)

Loading...

As we’ve studied, the resulting error vector,

{\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#004d40} \vec p}

is orthogonal to $\color{#3d81f6} \vec x$ .

In our original look at the approximation problem, we were approximating $\color{orange} \vec y$ using a scalar multiple of just a single vector, $\color{#3d81f6} \vec x$ . The set of all scalar multiples of $\color{#3d81f6} \vec x$ , denoted by $\text{span}(\{ {\color{#3d81f6} \vec x}\})$ , is a line in $\mathbb{R}^n$ .

Key idea: instead of projecting onto the subspace spanned by just a single vector, how might we project onto the subspace spanned by multiple vectors?

Approximating using Multiple Vectors¶

Equipped with our understanding of linear independence, spans, subspaces, and column spaces, we’re ready to tackle a more advanced version of the approximation problem.

The Approximation Problem

Suppose ${\color{#3d81f6} \vec x^{(1)}}, {\color{#3d81f6} \vec x^{(2)}}, ..., {\color{#3d81f6} \vec x^{(d)}} \in \mathbb{R}^n$ , and $\color{orange} \vec y \in \mathbb{R}^n$ is not necessarily in $\text{span}(\{{\color{#3d81f6} \vec x^{(1)}}, {\color{#3d81f6} \vec x^{(2)}}, ..., {\color{#3d81f6} \vec x^{(d)}}\})$ .

We can construct the matrix $\color{#3d81f6} X$ by placing the ${\color{#3d81f6} \vec x^{(i)}}$ 's in its columns:

{\color{#3d81f6} X} = \color{#3d81f6}\begin{bmatrix} {\color{#3d81f6} |} & {\color{#3d81f6} |} & & {\color{#3d81f6} |} \\ {\color{#3d81f6} \vec{x}^{(1)}} & {\color{#3d81f6} \vec{x}^{(2)}} & \cdots & {\color{#3d81f6} \vec{x}^{(d)}} \\ {\color{#3d81f6} |} & {\color{#3d81f6} |} & & {\color{#3d81f6} |} \end{bmatrix}

Then, the following three statements are all equivalent ways of asking the approximation problem:

Among all vectors in $\text{span}(\{{\color{#3d81f6} \vec x^{(1)}}, {\color{#3d81f6} \vec x^{(2)}}, ..., {\color{#3d81f6} \vec x^{(d)}}\})$ , which is closest to $\color{orange} \vec y$ ?
Among all vectors in $\text{colsp}({\color{#3d81f6} X})$ , which is closest to $\color{orange} \vec y$ ?
Among all vectors of the form ${\color{#3d81f6} X} \vec w$ , where $\vec w \in \mathbb{R}^d$ , which is closest to $\color{orange} \vec y$ ?

All three statements at the bottom of the box above are asking the exact same question; I’ve presented all three forms so that you see more clearly how the ideas of spans, column spaces, and matrix-vector multiplication fit together. I will tend to refer to the latter two versions of the problem the most. In what follows, suppose $\color{#3d81f6} X$ is an $n \times d$ matrix whose columns ${\color{#3d81f6} \vec x^{(1)}}$ , ${\color{#3d81f6} \vec x^{(2)}}$ , ..., ${\color{#3d81f6} \vec x^{(d)}}$ are the building blocks we want to approximate $\color{orange} \vec y$ with.

First, let’s get the trivial case out of the way. If ${\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X})$ , then the vector in $\text{colsp}({\color{#3d81f6} X})$ that is closest to $\color{orange} \vec y$ is just $\color{orange} \vec y$ itself. If that’s the case, there exists some $\vec w$ such that ${\color{orange} \vec y} = {\color{#3d81f6} X} \vec w$ exactly. This $\vec w$ is unique only if ${\color{#3d81f6} X}$ 's columns are linearly independent; otherwise, there will be infinitely many good $\vec w$ ’s.

But, that’s not the case I’m really interested in. I care more about when $\color{orange} \vec y$ is not in $\text{colsp}({\color{#3d81f6} X})$ . (Remember, this is the case we’re interested in when we’re doing linear regression: usually, it’s not possible to make our predictions 100% correct, and we’ll have to settle for some error.)

Then what?

In general, $\text{colsp}({\color{#3d81f6} X})$ is an $r$ -dimensional subspace of $\mathbb{R}^n$ , where $r = \text{rank}({\color{#3d81f6} X})$ . In the diagram below, I’ve used a plane to represent $\text{colsp}({\color{#3d81f6} X})$ ; just remember that ${\color{#3d81f6} X}$ may have more than 3 rows or columns.

from utils import plot_vectors
import numpy as np
import plotly.graph_objects as go

y = np.array([1, 3, 2]) * 1.5
v1 = (1, 0, 0.25)
v2 = np.array([2, 2, 0]) * 0.5

def create_base_fig(show_spanners=True):
    # Define the vectors
    v3 = (-3.5 * v1[0] + 4 * v2[0], -3.5 * v1[1] + 4 * v2[1], -3.5 * v1[2] + 4 * v2[2])  # v3 is v1 + v2, which is on the plane spanned by v1 and v2

    # Plot the vectors using plot_vectors function
    vectors = [
        (tuple(y), "orange", "y"),
        (v1, "#3d81f6", "x⁽¹⁾"), 
        (tuple(-v2), "#3d81f6", "x⁽²⁾"),
        (v3, "#3d81f6", "x⁽³⁾")
    ]
    
    if not show_spanners:
        vectors = [vectors[0]]

    fig = plot_vectors(vectors, show_axis_labels=True, vdeltaz=5, vdeltax=1)

    # Make the plane look more rectangular by using a smaller, symmetric range for s and t
    plane_extent = 20  # controls the "size" of the rectangle
    num_points = 3   # fewer points for a cleaner rectangle

    s_range = np.linspace(-plane_extent, plane_extent, num_points)
    t_range = np.linspace(-plane_extent, plane_extent, num_points)
    s_grid, t_grid = np.meshgrid(s_range, t_range)

    plane_x = s_grid * v1[0] + t_grid * v2[0]
    plane_y = s_grid * v1[1] + t_grid * v2[1]
    plane_z = s_grid * v1[2] + t_grid * v2[2]

    fig.add_trace(go.Surface(
        x=plane_x,
        y=plane_y,
        z=plane_z,
        opacity=0.8,
        colorscale=[[0, 'rgba(61,129,246,0.3)'], [1, 'rgba(61,129,246,0.3)']],
        showscale=False,
    ))

    # Annotate the plane with "colsp(X)"
    # Move the annotation "down" the plane by choosing negative s and t values
    label_s = 0.3
    label_t = 0.9
    label_x = label_s * v1[0] + label_t * v2[0]
    label_y_coord = label_s * v1[1] + label_t * v2[1] - 2
    label_z = label_s * v1[2] + label_t * v2[2]

    fig.add_trace(go.Scatter3d(
        x=[4],
        y=[-2],
        z=[2],  # small offset above the plane for visibility
        mode="text",
        text=[r"colsp(X)"],
        textposition="middle center",
        textfont=dict(size=22, color="#3d81f6"),
        showlegend=False
    ))

    # Set equal ranges for all axes to make grid boxes square
    axis_range = [-3, 5]  # Same range for all axes
    
    fig.update_layout(
        scene_camera=dict(
            eye=dict(x=1, y=-1.3, z=1)
        ),
        scene=dict(
            xaxis=dict(
                range=axis_range,
                dtick=1,
            ),
            yaxis=dict(
                range=axis_range,
                dtick=1,
            ),
            zaxis=dict(
                range=axis_range,
                dtick=1,
            ),
            aspectmode="cube",
            aspectratio=dict(x=1, y=1, z=1),  # Explicitly set 1:1:1 ratio
        ),
    )
    return fig

create_base_fig().show()

Loading...

Remember that $\text{colsp}({\color{#3d81f6} X})$ is the set of linear combinations of ${\color{#3d81f6} X}$ 's columns, so it’s the set of all vectors that can be written as ${\color{#3d81f6} X} \vec w$ , where $\vec w \in \mathbb{R}^d$ .

Let’s consider two possible vectors of the form ${\color{#3d81f6} X} \vec w$ , and look at their corresponding error vectors, ${\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ . I won’t draw the columns of ${\color{#3d81f6} X}$ , since those would clutter up the picture.

def plot_vectors_with_errors(vectors_list, base_fig=None, notation="0"):
    """
    Takes a list of vectors and draws each one in #004d40 with a dotted error line in #d81a60.
    
    Parameters:
    -----------
    vectors_list : list of array-like
        List of vectors to plot. Each vector should be a 3D point/vector.
    base_fig : plotly.graph_objects.Figure, optional
        Base figure to build upon. If None, creates a new base figure using create_base_fig().
    
    Returns:
    --------
    fig : plotly.graph_objects.Figure
        The figure with vectors and error lines added.
    """
    fig = create_base_fig(show_spanners=False)
    
    # Get y vector from create_base_fig (defined the same way)
    y = np.array([1, 3, 2]) * 1.5
    
    for i, vec in enumerate(vectors_list):
        vec = np.array(vec)
        
        # Draw the vector in #004d40
        fig.add_trace(go.Scatter3d(
            x=[0, vec[0]],
            y=[0, vec[1]],
            z=[0, vec[2]],
            mode='lines',
            line=dict(color='#004d40', width=6),
            showlegend=False
        ))
        
        # Add arrowhead for the vector
        fig.add_trace(go.Cone(
            x=[vec[0]],
            y=[vec[1]],
            z=[vec[2]],
            u=[vec[0]],
            v=[vec[1]],
            w=[vec[2]],
            colorscale=[[0, '#004d40'], [1, '#004d40']],
            showscale=False,
            sizemode="absolute",
            sizeref=0.3,
            showlegend=False
        ))
        
        # Draw the error line (dotted) from vec to y in #d81a60
        fig.add_trace(go.Scatter3d(
            x=[vec[0], y[0]],
            y=[vec[1], y[1]],
            z=[vec[2], y[2]],
            mode='lines',
            line=dict(color='#d81a60', width=4, dash='dash'),
            showlegend=False
        ))

        # Annotate the vectors and error lines
        if i == 0:
            # Annotate the vector as Xw₀
            fig.add_trace(go.Scatter3d(
                x=[1], y=[3], z=[-0.5],
                mode='text',
                text=["p\u2080 = Xw\u2080"] if notation == "0" else ["p = Xw*"],  # \u2080 is Unicode subscript zero
                textposition="middle center",
                textfont=dict(color='#004d40', size=16),
                showlegend=False
            ))
            # Annotate the error as e₀
            mid_error = (vec + y) / 2
            fig.add_trace(go.Scatter3d(
                x=[mid_error[0]],
                y=[mid_error[1]],
                z=[mid_error[2]],
                mode='text',
                text=["e\u2080"] if notation == "0" else ["e"],
                textposition="bottom center",
                textfont=dict(color='#d81a60', size=16),
                showlegend=False
            ))
        else:
            # Annotate the vector as Xw'
            fig.add_trace(go.Scatter3d(
                x=[2], y=[-1], z=[1],
                mode='text',
                text=["p' = Xw'"],
                textposition="middle center",
                textfont=dict(color='#004d40', size=16),
                showlegend=False
            ))
            # Annotate the error as e'
            mid_error = (vec + y) / 2
            fig.add_trace(go.Scatter3d(
                x=[mid_error[0]],
                y=[mid_error[1]],
                z=[mid_error[2]],
                mode='text',
                text=["e'"],
                textposition="bottom center",
                textfont=dict(color='#d81a60', size=16),
                showlegend=False
            ))
    
    return fig

fig = create_base_fig()

# y = np.array([1, 3, 2]) * 1.5
# v1 = (1, 0, 0.25)
# v2 = np.array([2, 2, 0]) * 0.5

X = np.vstack([v1, v2]).T
w = np.linalg.inv(X.T @ X) @ X.T @ y
# w = [-1.333, 3.6667]
p = X @ w
other_p = X @ np.array([2, 1])

test_vectors = [p, other_p]

# Plot vectors with errors
fig = plot_vectors_with_errors(test_vectors, fig)
fig.show()

Loading...

Our problem boils down to finding the $\vec w$ that minimizes the norm of the error vector. Since it’s a bit easier to work with squared norms (remember that $\lVert \vec x \rVert^2 = \vec x \cdot \vec x$ ), we’ll minimize the squared norm of the error vector instead; this is an equivalent problem, since the norm is non-negative to begin with.

\underbrace{\lVert {\color{#d81a60} \vec e} \rVert^2 = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2}_{\text{which $\vec w$ minimizes this?}}

Think of $\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2$ as a function of $\vec w$ only; $\color{#3d81f6} X$ and $\color{orange} \vec y$ should be thought of as fixed. This is a least squares problem: we’re looking for the $\vec w$ that minimizes the sum of squared errors between $\color{orange} \vec y$ and ${\color{#3d81f6} X} \vec w$ .

There are two ways we’ll minimize this function of $\vec w$ :

Using a geometric argument, as we did in the single vector case.
Using calculus. This is more involved than before, since the input variable is a vector, not a scalar, but it can be done, as we’ll see in Chapter 8.

Let’s focus on the geometric argument. What does our intuition tell us? Extending the single vector case, we expect the vector in $\text{colsp}({\color{#3d81f6} X})$ that is closest to $\color{orange} \vec y$ to be the orthogonal projection of $\color{orange} \vec y$ onto $\text{colsp}({\color{#3d81f6} X})$ : that is, its error should be orthogonal to $\text{colsp}({\color{#3d81f6} X})$ .

We could see this intuitively in the visual above. $\vec w_\text{o}$ was chosen to make $\color{#d81a60}\vec e_\text{o}$ orthogonal to $\text{colsp}({\color{#3d81f6} X})$ , meaning that $\color{#d81a60}\vec e_\text{o}$ is orthogonal to every vector in $\text{colsp}({\color{#3d81f6} X})$ . (The subscript “o” stands for “orthogonal”.) $\vec w'$ was some other arbitrary vector, leading $\color{#d81a60}\vec e'$ to not be orthogonal to $\text{colsp}({\color{#3d81f6} X})$ . Clearly $\color{#d81a60}\vec e_\text{o}$ is shorter than $\color{#d81a60}\vec e'$ .

To prove that the optimal choice of $\vec w$ comes from making the error vector orthogonal to $\text{colsp}({\color{#3d81f6} X})$ , you could use the same argument as in the single vector case: if you consider two vectors, $\vec w_\text{o}$ with an orthogonal error vector $\color{#d81a60}\vec e_\text{o}$ , and $\vec w'$ with an error vector $\color{#d81a60}\vec e'$ that is not orthogonal to $\text{colsp}({\color{#3d81f6} X})$ , then we can draw a right triangle with $\color{#d81a60}\vec e'$ as the hypotenuse and $\color{#d81a60}\vec e_\text{o}$ as one of the legs, making

\lVert {\color{#d81a60}\vec e'} \rVert^2 > \lVert {\color{#d81a60}\vec e_\text{o}} \rVert^2

This is such an important idea that I want to redraw the picture above with just the orthogonal projection. Note that I’ve replaced $\vec w_\text{o}$ with $\vec w^*$ to indicate that this is the optimal choice of $\vec w$ .

fig = create_base_fig()

# y = np.array([1, 3, 2]) * 1.5
# v1 = (1, 0, 0.25)
# v2 = np.array([2, 2, 0]) * 0.5

X = np.vstack([v1, v2]).T
w = np.linalg.inv(X.T @ X) @ X.T @ y
# w = [-1.333, 3.6667]
p = X @ w
test_vectors = [p]

# Plot vectors with errors and draw a right angle between p and e
fig = plot_vectors_with_errors(test_vectors, fig, notation='*')
fig.show(renderer='png', scale=2)

A Proof that the Orthogonal Error Vector is the Shortest¶

In office hours, a student asked for more justification that the shortest possible error vector is one that’s orthogonal to the column space of $\color{#3d81f6} X$ , especially because it’s hard to visualize what orthogonality looks like in higher dimensions. Remember, all it means for two vectors to be orthogonal is that their dot product is 0.

Given that, let’s assume only that

$\vec w_\text{o}$ is chosen so that ${\color{#d81a60} \vec e_\text{o}} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o}$ is orthogonal to $\text{colsp}({\color{#3d81f6} X})$ , and
$\vec w'$ is any other choice of $\vec w$ , with a corresponding error vector ${\color{#d81a60} \vec e'} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w'$ .

Just with these facts alone, we can show that $\color{#d81a60} \vec e_\text{o}$ is the shortest possible error vector. To do so, let’s start by considering the (squared) magnitude of $\color{#d81a60} \vec e'$ :

\begin{align*} \lVert {\color{#d81a60} \vec e'} \rVert^2 &= \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w' \rVert^2 \\ &= \underbrace{\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w' + {\color{#3d81f6} X} \vec w_\text{o} - {\color{#3d81f6} X} \vec w_\text{o} \rVert^2}_{\text{seems arbitrary, but it's a legal operation that brings } \vec w_\text{o} \text{back in}} \\ &= \lVert \underbrace{{\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o}}_{``a"} + \underbrace{{\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{``b"} \rVert^2 \\ &= \underbrace{\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 + 2 ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w_\text{o}) \cdot {\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{\lVert \vec a + \vec b \rVert^2 = \lVert \vec a \rVert^2 + \lVert \vec b \rVert^2 + 2 \vec a \cdot \vec b} \\ &= \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 + \underbrace{2 {\color{#d81a60} \vec e_\text{o}} \cdot {\color{#3d81f6} X} (\vec w_\text{o} - \vec w')}_{\text{0, because } {\color{#d81a60} \vec e_\text{o}} \text{ is orthogonal to the columns of } \color{#3d81f6} X} \\ &= \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 + \lVert {\color{#3d81f6} X} (\vec w_\text{o} - \vec w') \rVert^2 \\ \geq \lVert {\color{#d81a60} \vec e_\text{o}} \rVert^2 \end{align*}

So, no matter what choice of $\vec w'$ we make, the magnitude of $\color{#d81a60} \vec e'$ can’t be smaller than the magnitude of $\color{#d81a60} \vec e_\text{o}$ . This means that the error vector that is orthogonal to the column space of $\color{#3d81f6} X$ is the shortest possible error vector.

This is really just the same proof as in Chapter 3.4, where we argued that ${\color{#d81a60} \vec e_\text{o}}$ , ${\color{#3d81f6} X}(\vec w_\text{o} - \vec w')$ , and $\color{#d81a60} \vec e'$ form a right triangle, where $\color{#d81a60} \vec e'$ is the hypotenuse.

The Normal Equation¶

We’ve come to the conclusion that in order to find the $\vec w$ that minimizes

\lVert {\color{#d81a60} \vec e} \rVert^2 = \lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2

we need to find the $\vec w$ that makes the error vector ${\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ orthogonal to $\text{colsp}({\color{#3d81f6} X})$ . $\text{colsp}({\color{#3d81f6} X})$ is the set of all linear combinations of ${\color{#3d81f6} X}$ 's columns. So, if we can find an $\color{#d81a60} \vec e$ that is orthogonal to every column of ${\color{#3d81f6} X}$ , then it must be orthogonal to any of their linear combinations, too.

So, we’re looking for a ${\color{#d81a60} \vec e} = {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ that satisfies

\begin{align*} {\color{#3d81f6} \vec x^{(1)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \\ {\color{#3d81f6} \vec x^{(2)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \\ &\vdots \\ {\color{#3d81f6} \vec x^{(d)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= 0 \end{align*}

As you might have guessed, there’s an easier way to write these $d$ equations simultaneously. Above, we’re taking the dot product of ${\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ with each of the columns of $\color{#3d81f6} X$ . We’ve learned that $A \vec v$ contains the dot products of $\vec v$ with the rows of $A$ . So how do we get the dot products of ${\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ with the columns of $\color{#3d81f6} X$ ? Transpose it!

{\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = {\color{#3d81f6}\begin{bmatrix} {\color{#3d81f6} -} & {\color{#3d81f6} \vec{x}^{(1)}}^T & {\color{#3d81f6} -} \\ {\color{#3d81f6} -} & {\color{#3d81f6} \vec{x}^{(2)}}^T & {\color{#3d81f6} -} \\ & \vdots & \\ {\color{#3d81f6} -} & {\color{#3d81f6} \vec{x}^{(d)}}^T & {\color{#3d81f6} -} \end{bmatrix}} ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = \begin{bmatrix} {\color{#3d81f6}\vec x^{(1)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \\ {\color{#3d81f6}\vec x^{(2)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \\ \vdots \\ {\color{#3d81f6}\vec x^{(d)}} \cdot ({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) \end{bmatrix}

So, if we want ${\color{orange} \vec y} - {\color{#3d81f6} X} \vec w$ to be orthogonal to each of the columns of $\color{#3d81f6} X$ , then we need ${\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) = \vec 0$ (note that this is the vector $\vec 0 \in \mathbb{R}^d$ , not the scalar 0.) Another way of saying this is that we need the error vector to be in the left null space of $\color{#3d81f6} X$ , i.e. ${\color{#d81a60} \vec e} \in \text{null}({\color{#3d81f6} X^T})$ .

\begin{align*} {\color{#3d81f6}X^T}{\color{#d81a60} \vec e} &= 0 \\{\color{#3d81f6} X^T}({\color{orange} \vec y} - {\color{#3d81f6} X} \vec w) &= \vec 0 \\ {\color{#3d81f6} X^T} {\color{orange} \vec y} - {\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w &= \vec 0 \\ {\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w &= {\color{#3d81f6} X^T} {\color{orange} \vec y} \end{align*}

The final equation above is called the normal equation. “Normal” means “orthogonal”. Sometimes it’s called the normal equations to reference the fact that it’s a system of $d$ equations and $d$ unknowns, where the unknowns are components of $\vec w$ ( $w_1, w_2, \ldots, w_d$ .)

Note that ${\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y}$ looks a lot like ${\color{#3d81f6} X} \vec w = \color{orange} \vec y$ , with added factors of $\color{#3d81f6} X^T$ on the left. Remember that if $\color{orange} \vec y$ is in $\text{colsp}({\color{#3d81f6} X})$ , then ${\color{#3d81f6} X} \vec w = \color{orange} \vec y$ has a solution, but that’s usualy not the case, hence why we’re attempting to approximate $\color{orange} \vec y$ with a linear combination of $\color{#3d81f6} X$ ’s columns.

Is there a unique vector $\vec w$ that satisfies the normal equation? That depends on whether $\color{#3d81f6} X^TX$ is invertible. $\color{#3d81f6} X^TX$ is a $d \times d$ matrix with the same rank as the $n \times d$ matrix $\color{#3d81f6} X$ , as we proved in Chapter 5.4.

\text{rank}({\color{#3d81f6} X^TX}) = \text{rank}({\color{#3d81f6} X})

So, $\color{#3d81f6} X^TX$ is invertible if and only if $\text{rank}({\color{#3d81f6} X}) = d$ , meaning all of $\color{#3d81f6} X$ ’s columns are linearly independent. In that case, the best choice of $\vec w$ is the unique vector

\boxed{\vec w^* = ({\color{#3d81f6} X^TX})^{-1} \color{#3d81f6}X^T \color{orange} \vec y}

$\vec w^*$ has a star on it, denoting that it is the best choice of $\vec w$ . I don’t ask you to memorize much (you get to bring a notes sheet into your exams, after all), but this equation is perhaps the most important of the semester! It might even look familiar: back in the single vector case in Chapter 3.4, the optimal coefficient on $\vec x$ was $\frac{\vec x \cdot \vec y}{\vec x \cdot \vec x} = \frac{\vec x^T \vec y}{\vec x^T \vec x}$ , which looks similar to the one above. The difference is that here, $\color{#3d81f6} X^T X$ is a matrix, not a scalar. (But, if $\color{#3d81f6} X$ is just a matrix with a single column, then $\color{#3d81f6} X^T X$ is just the dot product of $\color{#3d81f6} X$ with itself, which is a scalar, and the boxed formula above reduces to the formula from Chapter 3.4.)

What if $\color{#3d81f6} X^TX$ isn’t invertible? Then, there are infinitely many $\vec w^*$ 's that satisfy the normal equation,

{\color{#3d81f6} X^T} {\color{#3d81f6} X} \vec w = {\color{#3d81f6} X^T} {\color{orange} \vec y}

It’s not immediately obvious what it means for there to be infinitely many solutions to the normal equation; I’ve dedicated a whole subsection to it below to give this idea the consideration it deserves.

In the examples that follow, we’ll look at how to find all of the solutions to this equation when there are infinitely many.

First, let’s start with a straightforward example. Let

{\color{#3d81f6} X = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}}, \quad {\color{orange} \vec y = \begin{bmatrix} 1 \\ 0 \\ 4 \\ 5 \end{bmatrix}}

The vector in $\text{colsp}({\color{#3d81f6} X})$ that is closest to $\color{orange} \vec y$ is the vector ${\color{#3d81f6} X} \vec w^*$ , where $\vec w^*$ is the solution to the normal equations,

{\color{#3d81f6} X^T X} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y

The first step is to compute ${\color{#3d81f6} X^T X}$ , which is a $2 \times 2$ matrix of dot products of the columns of ${\color{#3d81f6} X}$ .

{\color{#3d81f6} X^T X = \begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}}

${\color{#3d81f6} X^T X}$ is invertible, so we can solve for $\vec w^*$ uniquely. Remember that in practice, we’d ask Python to solve np.linalg.solve(X.T @ X, X.T @ y), but here $\color{#3d81f6} X^T X$ is small enough that we can invert it by hand.

{\color{#3d81f6} X^T X = \begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}} \implies ({\color{#3d81f6} X^T X})^{-1} = \frac{1}{9} \begin{bmatrix} 3 & -3 \\ -3 & 6 \end{bmatrix} = \begin{bmatrix} 1/3 & -1/3 \\ -1/3 & 2/3 \end{bmatrix}

Then,

\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} {\color{orange} \vec y} = \underbrace{\begin{bmatrix} 1/3 & -1/3 \\ -1/3 & 2/3 \end{bmatrix}}_{({\color{#3d81f6} X^T X})^{-1}} \underbrace{{\color{#3d81f6}\begin{bmatrix} 1 & 2 & 1 & 0 \\ 0 & 1 & 1 & -1 \end{bmatrix}}}_{\color{#3d81f6}X^T} {\color{orange}\begin{bmatrix} 1 \\ 0 \\ 4 \\ 5 \end{bmatrix}} = \begin{bmatrix} 2 \\ -8/3 \end{bmatrix}

The magic is in the interpretation of the numbers in $\vec w^*$ , 2 and $-\frac{8}{3}$ . These are the coefficients of the columns of ${\color{#3d81f6} X}$ in the linear combination that is closest to $\color{orange} \vec y$ . Meaning,

{\color{#3d81f6} X} \vec w^* = 2 \underbrace{\color{#3d81f6}\begin{bmatrix} 1 \\ 2 \\ 1 \\ 0 \end{bmatrix}}_{\color{#3d81f6}\vec x^{(1)}} - \frac{7}{3} \underbrace{\color{#3d81f6}\begin{bmatrix} 0 \\ 1 \\ 1 \\ -1 \end{bmatrix}}_{\color{#3d81f6}\vec x^{(2)}} = \begin{bmatrix} 2 \\ 5/3 \\ -1/3 \\ 7/3 \end{bmatrix}

is the vector in $\text{colsp}({\color{#3d81f6} X})$ that is closest to $\color{orange} \vec y$ . This vector is the orthogonal projection of $\color{orange} \vec y$ onto $\text{colsp}({\color{#3d81f6} X})$ .

Examples¶

The first example above is the most concrete. The examples that follow will build our understanding of how orthogonal projections really work.

Example: Point and Plane¶

Find the point on the plane $6x - 3y + 2z = 0$ that is closest to the point $(1, 1, 1)$ .

Solution

At first, this doesn’t seem like a projection problem, but it is. The plane

6x - 3y + 2z = 0

is a 2-dimensional subspace of $\mathbb{R}^3$ , meaning it can be described as the span of two non-collinear vectors. So, all we need to do is find some matrix $X$ with those two vectors as columns, and then use the formula we derived above to project $\vec y = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$ onto $\text{colsp}(X)$ . The fact that we said “point” instead of “vector” doesn’t change the problem: in settings like these, points and vectors are equivalent.

Two vectors that lie in the plane (but point in different directions) are $\begin{bmatrix} 1 \\ 2 \\ 0 \end{bmatrix}$ and $\begin{bmatrix} 1 \\ 0 \\ -3 \end{bmatrix}$ . There’s nothing special about these two vectors, other than that they have relatively small integer components. Let’s use them as columns of $X$ :

X = \begin{bmatrix} 1 & 1 \\ 2 & 0 \\ 0 & -3 \end{bmatrix}

Now, to formulate the normal equations, $X^TX \vec w^* = X^T \vec y$ , we need to compute $X^TX$ and $X^T \vec y$ .

X^TX = \begin{bmatrix} 1 & 2 & 0 \\ 1 & 0 & -3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 2 & 0 \\ 0 & -3 \end{bmatrix} = \begin{bmatrix} 5 & 1 \\ 1 & 10 \end{bmatrix}

X^T \vec y = \begin{bmatrix} 1 & 2 & 0 \\ 1 & 0 & -3 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 \\ -2 \end{bmatrix}

So, the normal equations are

\begin{bmatrix} 5 & 1 \\ 1 & 10 \end{bmatrix} \underbrace{\begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix}}_{\vec w^*} = \begin{bmatrix} 3 \\ -2 \end{bmatrix}

We could use $\vec w^* = (X^TX)^{-1}X^T \vec y$ , but often it’s easier to solve the system directly.

\begin{align*} 5w_0^* + w_1^* &= 3 \\ w_0^* + 10w_1^* &= -2 \end{align*}

Solving this system gives us $w_0^* = \frac{32}{49}$ and $w_1^* = -\frac{13}{49}$ . (Sorry, I know the numbers aren’t pretty in this example. But that’s what happens in the real world.)

Now that we’ve solved for $\vec w^* = \begin{bmatrix} \frac{32}{49} \\ -\frac{13}{49} \end{bmatrix}$ , the projection of $\vec y$ onto $\text{colsp}(X)$ - which is the point on the plane that’s closest to $\vec y$ - is

\begin{align*} X\vec w^* &= \begin{bmatrix} 1 & 1 \\ 2 & 0 \\ 0 & -3 \end{bmatrix} \begin{bmatrix} \frac{32}{49} \\ -\frac{13}{49} \end{bmatrix} \\ &= \begin{bmatrix} \frac{32}{49} - \frac{13}{49} \\ \frac{64}{49} \\ \frac{39}{49} \end{bmatrix} \\ &= \begin{bmatrix} \frac{19}{49} \\ \frac{64}{49} \\ \frac{39}{49} \end{bmatrix} \end{align*}

So, after all that work, we can conclude that the point on the plane $6x - 3y + 2z = 0$ that is closest to $(1, 1, 1)$ is $(\frac{19}{49}, \frac{64}{49}, \frac{39}{49})$ .

Example: What if ${\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X})$ ?¶

Find the orthogonal projection of $\color{orange} \vec y = \begin{bmatrix} 1 \\ 3 \\ 2 \\ -1 \end{bmatrix}$ onto $\text{colsp}({\color{#3d81f6} X})$ , where ${\color{#3d81f6} X = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}}$ .

Solution

First, notice that $\color{orange} \vec y$ is just the sum of the two columns of $\text{colsp}({\color{#3d81f6} X})$ . So intuitively, because ${\color{orange} \vec y}$ is already in the column space of ${\color{#3d81f6} X}$ , the projection is just ${\color{orange} \vec y}$ itself.

But let’s make sure the math works out that way.

First, we solve the normal equations for $\vec w^*$ .

\begin{align*} {\color{#3d81f6} X^TX} &= {\color{#3d81f6}\begin{bmatrix} 1 & 2 & 1 & 0 \\ 0 & 1 & 1 & -1 \end{bmatrix}\begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}} \\&= {\color{#3d81f6}\begin{bmatrix} 1 + 4 + 1 & 2 + 1 \\ 2 + 1 & 1 + 1 + 1 \end{bmatrix}} \\&= {\color{#3d81f6}\begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}} \\ \\ {\color{#3d81f6}X^T}{\color{orange}\vec y} &= {\color{#3d81f6}\begin{bmatrix} 1 & 2 & 1 & 0 \\ 0 & 1 & 1 & -1 \end{bmatrix}}{\color{orange}\begin{bmatrix} 1 \\ 3 \\ 2 \\ -1 \end{bmatrix}} \\&= \begin{bmatrix} 9 \\ 6 \end{bmatrix} \end{align*}

So, $\vec w^*$ is the solution to

\underbrace{{\color{#3d81f6} \begin{bmatrix} 6 & 3 \\ 3 & 3 \end{bmatrix}}}_{\color{#3d81f6}X^TX}\vec w^* = \underbrace{\begin{bmatrix} 9 \\ 6 \end{bmatrix}}_{\color{#3d81f6}X^T\color{orange}\vec y}

and here, it’s pretty clear that $\vec w^* = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$ . (Again, we could go through the hassle of inverting $X^TX$ to get the same answer, but there’s no need to.)

This means that the projection of $\color{orange} \vec y$ onto $\text{colsp}({\color{#3d81f6} X})$ is

{\color{#3d81f6}X}\vec w^* = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ 1 & 1 \\ 0 & -1 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \\ 2 \\ -1 \end{bmatrix} = \color{orange}\vec y

as we predicted from the start.

Example: Orthogonality with the Columns of $X$ ¶

Let $\color{#3d81f6} X = \begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 0 & 2 \\ 1 & 2 \end{bmatrix}$ , $\color{#004d40} Z = \begin{bmatrix} 1 & 0 \\ 0 & -1 \\ 0 & 1 \\ 3 & 0 \end{bmatrix}$ , and $\color{orange} \vec y$ be any vector in $\mathbb{R}^4$ .

Let $\vec p_X$ and $\vec p_Z$ be the orthogonal projections of $\color{orange} \vec y$ onto $\text{colsp}({\color{#3d81f6} X})$ and $\text{colsp}({\color{#004d40} Z})$ , respectively.

Explain why it is guaranteed that the components of the vector

{\color{#d81a60} \vec e_X} = {\color{orange} \vec y} - \vec p_X

sum to zero, but the components of the vector ${\color{#d81a60} \vec e_Z} = {\color{orange} \vec y} - \vec p_Z$ do not necessarily.

Solution

The error vectors ${\color{#d81a60} \vec e_X}$ and ${\color{#d81a60} \vec e_Z}$ are orthogonal to the column spaces of their respective matrices. In other words,

{\color{#3d81f6} X^T}{\color{#d81a60} \vec e_X}=\vec 0, \qquad {\color{#004d40} Z^T}{\color{#d81a60} \vec e_Z}=\vec 0

Suppose $\color{#d81a60} \vec e_X$ ’s components are $e_1, e_2, e_3, e_4$ . In the case of $\color{#3d81f6} X$ , we have that

\underbrace{\begin{bmatrix} 1 & 3 & 0 & 1 \\ 2 & 2 & 2 & 2 \end{bmatrix}}_{\color{#3d81f6} X^T} \underbrace{\begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4 \end{bmatrix}}_{\color{#d81a60} \vec e_X} = \begin{bmatrix} e_1 + 3e_2 + e_4 \\ 2e_1 + 2e_2 + 2e_3 + 2e_4 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

The second component of the above equation implies that $2e_1 + 2e_2 + 2e_3 + 2e_4 = 0$ , so $e_1 + e_2 + e_3 + e_4 = 0$ , meaning the components of $\color{#d81a60} \vec e_X$ sum to 0.

In the case of $\color{#004d40} Z$ , we don’t have that same assurance.

\underbrace{\begin{bmatrix} 1 & 0 & 0 & 3 \\ 0 & -1 & 1 & 0\end{bmatrix}}_{\color{#004d40} Z^T} \underbrace{\begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4 \end{bmatrix}}_{\color{#d81a60} \vec e_Z} = \begin{bmatrix} e_1 + 3e_4 \\ -e_2 + e_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

The condition ${\color{#004d40} Z^T}{\color{#d81a60} \vec e_Z} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$ gives us some relationships involving $e_1, e_2, e_3, e_4$ , but not enough to guarantee that the components sum to 0.

Key takeaway: Remember that $\color{#3d81f6} X^T \color{#d81a60} \vec e_X = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$ implies that $\color{#d81a60} \vec e_X$ is orthogonal to any linear combination of the columns of $\color{#3d81f6} X$ . So, if you can create a column of all 1’s using a linear combination of $\color{#3d81f6} X$ ’s columns, then the components of $\color{#d81a60} \vec e_X$ will sum to 0, no matter which vector $\color{orange} \vec y$ you choose to project onto $\text{colsp}({\color{#3d81f6} X})$ .

This is one of the most important consequences of orthogonal projections, especially as it relates to linear regression.

Example: $X$ with Orthogonal Columns¶

Let $\color{#3d81f6} X = \begin{bmatrix} 3/13 & -4/5 \\ 4/13 & 3/5 \\ 12/13 & 0 \end{bmatrix}$ and $\color{orange} \vec y = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}$ .

Find the value of $\vec w^*$ that minimizes $\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \rVert^2$ . What about $\color{#3d81f6} X$ makes this easier than in other examples?

Solution

Notice that both of $\color{#3d81f6} X$ ’s columns are orthogonal to each other, and are unit vectors, i.e. they are orthonormal. This means that

\color{#3d81f6} X^T \color{#3d81f6} X = \begin{bmatrix} 3/13 & 4/13 & 5/13 \\ -4/5 & 3/5 & 0 \end{bmatrix} \begin{bmatrix} 3/13 & -4/5 \\ 4/13 & 3/5 \\ 5/13 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

is the identity matrix, so

\vec w^* = \underbrace{({\color{#3d81f6} X^T \color{#3d81f6} X})^{-1}}_{I} {\color{#3d81f6} X^T} {\color{orange} \vec y} = {\color{#3d81f6} X^T} {\color{orange} \vec y} = \begin{bmatrix} 15/13 \\ -4/5 \end{bmatrix}

Usually, our data doesn’t come to us with orthogonal columns. But, using the Gram-Schmidt process introduced to you in Homework 7, you can convert a set of linearly independent vectors into an orthogonal set with the same span, allowing you to leverage that $Q^TQ = I$ and simplify the projection process.

Key takeaway: Orthonormal vectors are very easy to work with!

Example: Why Can’t We Separate?¶

Why can’t we separate

\vec w^* = ({\color{#3d81f6} X^T X})^{-1} {\color{#3d81f6} X^T} \color{orange} \vec y

into

\vec w^* = {\color{#3d81f6} X^{-1}} \underbrace{({\color{#3d81f6} X^T})^{-1} {\color{#3d81f6} X^T}}_{I} {\color{orange} \vec y} = {\color{#3d81f6} X^{-1}} \color{orange} \vec y

in general?

Solution

In general, $\color{#3d81f6} X$ is not a square matrix, so it can’t be invertible!

If $\color{#3d81f6} X$ is square and invertible, then the steps above are valid. But, if $\color{#3d81f6} X$ is a square and invertible $n \times n$ matrix, then $\text{colsp}({\color{#3d81f6} X}) = \mathbb{R}^n$ , meaning that any $\color{orange} \vec y$ in $\mathbb{R}^n$ can be written as a linear combination of $\color{#3d81f6} X$ ’s columns, meaning that $\color{orange} \vec y$ is already in $\text{colsp}({\color{#3d81f6} X})$ , and there is no projection error (or need to do a projection in the first place).

Next, we handle the case where the columns are linearly dependent and develop the complete solution.

6.3. Projecting onto the Column Space

Approximating using a Single Vector¶

Approximating using Multiple Vectors¶

A Proof that the Orthogonal Error Vector is the Shortest¶

The Normal Equation¶

Examples¶

Example: Point and Plane¶

Example: What if y⃗∈colsp(X){\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X})y​∈colsp(X)?¶

Example: Orthogonality with the Columns of XXX¶

Example: XXX with Orthogonal Columns¶

Example: Why Can’t We Separate?¶

Example: What if ${\color{orange} \vec y} \in \text{colsp}({\color{#3d81f6} X})$ ?¶

Example: Orthogonality with the Columns of $X$ ¶

Example: $X$ with Orthogonal Columns¶