2.3. Finding Optimal Parameters

import pandas as pd
import numpy as np
import plotly.graph_objects as go

def slope(x, y):
    # Assume x and y are two Series.
    numerator = ((x - np.mean(x)) * (y - np.mean(y))).sum()
    denominator = ((x - np.mean(x)) ** 2).sum()
    return numerator / denominator

def intercept(x, y):
    return y.mean() - slope(x, y) * x.mean()

df = pd.read_csv('data/commute-times.csv')

w0_star = intercept(df['departure_hour'], df['minutes'])
w1_star = slope(df['departure_hour'], df['minutes'])

def mse(y_actual, y_pred):
    return np.mean((y_actual - y_pred)**2)

def mse_for_departure_model(w):
    w0, w1 = w
    return mse(df['minutes'], w0 + w1 * df['departure_hour'])

num_points = 50 # increase for better resolution, but it will run more slowly. 
uvalues = np.linspace(120, 160, num_points)
vvalues = np.linspace(-13, -3, num_points)
(u, v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(), v.flatten()))
MSE = np.array([mse_for_departure_model(t) for t in thetas.T])
loss_surface = go.Surface(
    x=u, y=v, z=np.reshape(MSE, u.shape),
    colorscale='PuRd',
    showscale=False,
    opacity=1,
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
minimizer = go.Scatter3d(
    x=[w0_star], y=[w1_star], z=[mse_for_departure_model([w0_star, w1_star])],
    mode='markers', name='optimal parameters',
    marker=dict(size=10, color='gold'),
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
fig = go.Figure(data=[loss_surface, minimizer])

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title="w₀",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title="w₁",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title="R(w₀, w₁)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    scene_camera=dict(
        eye=dict(x=-1.5, y=1.5, z=1.0)
    ),
)

fig.show(scale=4)

Loading...

We’ve learned that to minimize $R_\text{sq}(w_0, w_1)$ , we’ll need to find both of its partial derivatives, and solve for the point $(w_0^*, w_1^*, R_\text{sq}(w_0^*, w_1^*))$ at which they’re both 0.

Let’s start with the partial derivative with respect to $w_0$ :

\begin{align*} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_0} &= \frac{\partial}{\partial w_0} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_0} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_0}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-1) \\ &=-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \end{align*}

Onto $w_1$ :

\begin{aligned} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_1} &= \frac{\partial}{\partial w_1} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_1} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_1}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-x_i) \\ &=-\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

All in one place now:

\begin{aligned} &\frac{\partial R_{\text{sq}}}{\partial w_0} = -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \\ &\frac{\partial R_{\text{sq}}}{\partial w_1} = -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

These look very similar – it’s just $\frac{\partial R_{\text{sq}}}{\partial w_1}$ has an added $x_i$ in the summation.

Remember, both partial derivatives are functions of two variables: $w_0$ and $w_1$ . We’re treating the $x_i$ ’s and $y_i$ ’s as constants. If I already have a dataset, you can pick an intercept $w_0$ and slope $w_1$ and I can use these formulas to compute the partial derivatives of $R_\text{sq}$ for that combination of intercept and slope.

In case it helps you put things in perspective, here’s how I might implement these formulas in code, assuming that x and y are arrays:

# Assume x and y are defined somewhere above this function.
def partial_R_w0(w0, w1):
    # Sub-optimal technique, since it uses a for-loop.
    total = 0
    for i in range(len(x)):
        total += (y[i] - (w0 + w1 * x[i]))
    return -2 * total / len(x)
    # Returns a single number!

def partial_R_w1(w0, w1):
    # Better technique, as it uses vectorized operations.
    return -2 * np.mean(x * (y - (w0 + w1 * x)))
    # Also returns a single number!

Before we solve for where both $\frac{\partial R_{\text{sq}}}{\partial w_0}$ and $\frac{\partial R_{\text{sq}}}{\partial w_1}$ are 0, let’s visualize them in the context of our loss surface.

import pandas as pd
import numpy as np
import plotly.graph_objects as go

title_maker = lambda w0, w1, axis: (
    "<span style='color: #d81b60; font-weight: bold;'>R(w₀, w₁)"
)

def slope(x, y):
    numerator = ((x - np.mean(x)) * (y - np.mean(y))).sum()
    denominator = ((x - np.mean(x)) ** 2).sum()
    return numerator / denominator

def intercept(x, y):
    return y.mean() - slope(x, y) * x.mean()

df = pd.read_csv('data/commute-times.csv')

def mse(y_actual, y_pred):
    return np.mean((y_actual - y_pred)**2)

def mse_for_departure_model(w):
    w0, w1 = w
    return mse(df['minutes'], w0 + w1 * df['departure_hour'])

num_points = 50
w0_vals = np.linspace(90, 190, num_points)
w1_vals = np.linspace(-13, -3, num_points)
W0, W1 = np.meshgrid(w0_vals, w1_vals)
thetas = np.vstack((W0.flatten(), W1.flatten()))
MSE = np.array([mse_for_departure_model(t) for t in thetas.T])
Z = np.reshape(MSE, W0.shape)

def slice_curve(axis, c):
    if axis == "w0":
        v_vals = np.linspace(-13, -3, 100)
        z_vals = [mse_for_departure_model((c, v)) for v in v_vals]
        return np.full_like(v_vals, c), v_vals, z_vals
    elif axis == "w1":
        u_vals = np.linspace(90, 190, 100)
        z_vals = [mse_for_departure_model((u, c)) for u in u_vals]
        return u_vals, np.full_like(u_vals, c), z_vals

def slice_plane(axis, c):
    if axis == "w0":
        Vp, Zp = np.meshgrid(np.linspace(-13, -3, 50), np.linspace(Z.min(), Z.max(), 50))
        Up = np.full_like(Vp, c)
        return Up, Vp, Zp
    elif axis == "w1":
        Up, Zp = np.meshgrid(np.linspace(90, 190, 50), np.linspace(Z.min(), Z.max(), 50))
        Vp = np.full_like(Up, c)
        return Up, Vp, Zp

def make_traces(axis):
    c0 = 140 if axis == "w0" else -8
    xs, ys, zs = slice_curve(axis, c0)
    Xp, Yp, Zp = slice_plane(axis, c0)

    plane_trace = go.Surface(
        x=Xp, y=Yp, z=Zp,
        showscale=False,
        opacity=0.4,
        colorscale=[[0, '#3d81f6'], [1, '#3d81f6']],
        visible=True,
        name=f"Slicing Plane"
    )
    curve_trace = go.Scatter3d(
        x=xs, y=ys, z=zs,
        mode="lines",
        line=dict(color="gold", width=12),
        visible=True,
        name=f"Slice Curve"
    )
    return plane_trace, curve_trace, c0

plane_w0, curve_w0, c0_w0 = make_traces("w0")
plane_w1, curve_w1, c0_w1 = make_traces("w1")
plane_w1.visible = False
curve_w1.visible = False

# Loss surface
loss_surface = go.Surface(
    x=W0, y=W1, z=Z,
    colorscale="PuRd",
    opacity=1,
    name="Loss Surface",
    showscale=False
)

fig = go.Figure(data=[loss_surface, plane_w0, curve_w0, plane_w1, curve_w1])

# --- Slider steps with actual formulas ---
def make_steps(axis):
    if axis == "w0":
        c_values = np.linspace(90, 190, 21)
        variable_label = "w₁"  # free variable
        constant_label = "w₀"  # held constant by slider
    else:
        c_values = np.linspace(-13, -3, 21)
        variable_label = "w₀"  # free variable
        constant_label = "w₁"  # held constant by slider

    steps = []
    for c in c_values:
        xs, ys, zs = slice_curve(axis, c)
        Xp, Yp, Zp = slice_plane(axis, c)

        # Generate title with proper formatting
        if axis == "w0":
            eq_text = title_maker(c, -8, axis)  # w0 is constant, w1 varies
            curve_label_text = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"
        else:
            eq_text = title_maker(140, c, axis)  # w1 is constant, w0 varies
            curve_label_text = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₀ only!"

        steps.append(dict(
            method="update",
            args=[{"x": [W0, Xp, xs, Xp, xs],
                   "y": [W1, Yp, ys, Yp, ys],
                   "z": [Z, Zp, zs, Zp, zs]},
                  {"annotations": [
                      dict(
                          text=eq_text,
                          xref="paper", yref="paper",
                          x=0.02, y=0.98,  # top-left
                          showarrow=False,
                          font=dict(size=14, family="Palatino", color="black"),
                          align="left"),
                      dict(
                          text=curve_label_text,
                          xref="paper", yref="paper",
                          x=0.5, y=0.6,  # just below the equation
                          showarrow=False,
                          font=dict(size=15, family="Palatino"),
                          align="left")
                  ]}],
            label=f"{c:.1f}" if axis == "w1" else f"{c:.0f}"
        ))
    return steps

steps_w0 = make_steps("w0")
steps_w1 = make_steps("w1")
sliders_w0 = [dict(
    active=10, 
    currentvalue={
        "prefix": "<span style='color: #3d81f6; font-weight: bold;'>Slice at w₀=</span>",
        "font": dict(family="Palatino", size=14, color="#3d81f6")
    },
    pad={"t": 30},
    steps=steps_w0,
    font=dict(family="Palatino", size=14, color="black")
)]
sliders_w1 = [dict(
    active=10, 
    currentvalue={
        "prefix": "<span style='color: #3d81f6; font-weight: bold;'>Slice at w₁=</span>",
        "font": dict(family="Palatino", size=14, color="#3d81f6")
    },
    pad={"t": 30},
    steps=steps_w1,
    font=dict(family="Palatino", size=14, color="black")
)]

# --- Updatemenu for toggling axis (top-right corner) ---
# Get initial values for immediate updates
init_w0_xs, init_w0_ys, init_w0_zs = slice_curve("w0", 140)
init_w0_Xp, init_w0_Yp, init_w0_Zp = slice_plane("w0", 140)
init_w0_eq = title_maker(140, -8, "w0")
init_w0_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"

init_w1_xs, init_w1_ys, init_w1_zs = slice_curve("w1", -8)
init_w1_Xp, init_w1_Yp, init_w1_Zp = slice_plane("w1", -8)
init_w1_eq = title_maker(140, -8, "w1")
init_w1_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₀ only!"

fig.update_layout(
    updatemenus=[dict(
        type="buttons",
        buttons=[
            dict(label="Slider for values of w₀",
                 method="update",
                 args=[{"visible": [True, True, True, False, False],
                        "x": [W0, init_w0_Xp, init_w0_xs, init_w0_Xp, init_w0_xs],
                        "y": [W1, init_w0_Yp, init_w0_ys, init_w0_Yp, init_w0_ys],
                        "z": [Z, init_w0_Zp, init_w0_zs, init_w0_Zp, init_w0_zs]},
                       {"sliders": sliders_w0,
                        "annotations": [
                            dict(
                                text=init_w0_eq,
                                xref="paper", yref="paper",
                                x=0.5, y=0.6,
                                showarrow=False,
                                font=dict(size=15, family="Palatino"),
                                align="left"),
                            dict(
                                text=init_w0_curve_label,
                                xref="paper", yref="paper",
                                x=0.02, y=0.88,
                                showarrow=False,
                                font=dict(size=12, family="Palatino"),
                                align="left",
                                bgcolor="white")
                        ]}]),
            dict(label="Slider for values of w₁",
                 method="update",
                 args=[{"visible": [True, False, False, True, True],
                        "x": [W0, init_w1_Xp, init_w1_xs, init_w1_Xp, init_w1_xs],
                        "y": [W1, init_w1_Yp, init_w1_ys, init_w1_Yp, init_w1_ys],
                        "z": [Z, init_w1_Zp, init_w1_zs, init_w1_Zp, init_w1_zs]},
                       {"sliders": sliders_w1,
                        "annotations": [
                            dict(
                                text=init_w1_eq,
                                xref="paper", yref="paper",
                                x=0.02, y=0.98,
                                showarrow=False,
                                font=dict(size=14, family="Palatino"),
                                align="left",
                                bgcolor="white"),
                            dict(
                                text=init_w1_curve_label,
                                xref="paper", yref="paper",
                                x=0.5, y=0.6,
                                showarrow=False,
                                font=dict(size=15, family="Palatino"),
                                align="left")
                        ]}])
        ],
        direction="down",
        x=1.0, y=1.0, 
        xanchor="right",
        yanchor="top",
        showactive=True
    )],
    sliders=sliders_w0,
    scene=dict(
        xaxis=dict(
            title="w₀", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        yaxis=dict(
            title="w₁", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        zaxis=dict(
            title="R(w₀, w₁)", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=700,
    height=700
)

# Initial annotations
init_eq = title_maker(140, -8, "w0")
init_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"

fig.add_annotation(
    text=init_eq,
    xref="paper", yref="paper",
    x=0.02, y=0.98,
    showarrow=False,
    font=dict(size=14, family="Palatino"),
    align="left",
    bgcolor="white"
)

fig.add_annotation(
    text=init_curve_label,
    xref="paper", yref="paper",
    x=0.5, y=0.6,
    showarrow=False,
    font=dict(size=15, family="Palatino"),
    align="left",
    # bgcolor="white"
)

fig.show(scale=4)

Loading...

Click “Slider for values of $w_0$ ”. No matter where you drag that slider, the resulting gold curve is a function of $w_1$ only. Every gold curve you see when dragging the $w_0$ slider will have a minimum at some value of $w_1$ .

Then, click “Slider for values of $w_1$ ”. No matter where you drag that slider, the resulting gold curve is a function of $w_0$ only, and has some minimum value.

But there is only one combination of $w_0$ and $w_1$ where the gold curves have minimums at the exact same intersecting point. That is the combination of $w_0$ and $w_1$ that minimizes $R_\text{sq}$ , and it’s who we’re searching for.

Solving for the Optimal Parameters¶

Now, it’s time to analytically (that is, on paper) find the values of $w_0^*$ and $w_1^*$ that minimize $R_\text{sq}$ . We’ll do so by solving the following system of two equations and two unknowns:

\begin{aligned} \frac{\partial R_\text{sq}}{\partial w_0} &= -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0 \\ \frac{\partial R_\text{sq}}{\partial w_1} &= -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )=0 \end{aligned}

Here’s my plan:

In the first equation, try and isolate for $w_0$ ; this value will be called $w_0^*$ .
Plug the expression for $w_0^*$ into the second equation to solve for $w_1^*$ .

Let’s start with the first step.

-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0

Multiplying both sides by $-\frac{n}{2}$ gives us:

\sum_{i = 1}^n( \underbrace{y_i}_\text{actual}-(\underbrace{w_0+w_1 x_i}_\text{predicted}))=0

Before I continue, I want to highlight that this itself is an importance balance condition, much like those we discussed in Chapter 1.3. It’s saying that the sum of the errors of the optimal line’s predictions – that is, the line with intercept $w_0^*$ and slope $w_1^*$ – is 0.

Let’s continue with the first step – I’ll try and keep the commentary to a minimum. It’s important to try and replicate these steps yourself, on paper.

\begin{aligned} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n( y_i-w_0-w_1 x_i )&=0 \\ \sum_{i = 1}^n y_i - \sum_{i = 1}^n w_0 - \sum_{i = 1}^n w_1 x_i&=0 \\ \sum_{i = 1}^n y_i - nw_0 - w_1\sum_{i = 1}^n x_i&=0\\ \sum_{i = 1}^n y_i - w_1\sum_{i = 1}^n x_i&=nw_0 \\ \frac{\sum_{i = 1}^n y_i }{n}- w_1\frac{\sum_{i = 1}^n x_i}{n}&=w_0 \\ w_0^*&=\bar{y}-w_1^* \bar{x} \end{aligned}

Awesome! We’re halfway there. We have a formula for the optimal slope, $w_0^*$ , in terms of the optimal intercept, $w_1^*$ . Let’s use $w_0^* = \bar{y}-w_1^* \bar{x}$ and see where it gets us in the second equation.

\begin{aligned} -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(\underbrace{\bar{y}-w_1^* \bar{x}}_{w_0^*}+w_1^* x_i) )&=0 \\ \sum_{i = 1}^n x_i( \underbrace{y_i-\bar{y}+w_1^* \bar{x}-w_1^* x_i}_\text{distribute negation})&=0 \\ \sum_{i = 1}^n x_i \left( (y_i-\bar{y})-w_1^* ( x_i - \bar{x}) \right) &=0 \\ \underbrace{\sum_{i = 1}^n x_i (y_i-\bar{y})-w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x})}_\text{expand summation} &=0 \sum_{i = 1}^n x_i (y_i-\bar{y}) &= w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x}) \\ w_1^* &= \frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})} \end{aligned}

Rewriting and Using the Formulas¶

We’re done! We have formulas for the optimal slope and intercept. But, before we celebrate, I’m going to try and rewrite $w_1^*$ in an equivalent, more symmetrical form that is easier to interpret.

Claim:

w_1^* = \underbrace{\frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})}}_\text{formula we derived above} = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{nicer looking formula}

Proof of equivalent, nicer looking formula

To show that these two formulas are equal, I’ll start by recapping the fact that the sum of deviations from the mean is 0, in other words:

\sum_{i=1}^n (x_i - \bar{x}) = 0

This has come up in homeworks and past sections of the notes, but for completeness, here’s the proof:

\begin{aligned} &\sum_{i = 1}^n x_i - \bar{x} \\ &\sum_{i = 1}^n x_i - \sum_{i = 1}^n \bar{x} \\ &=n \bar{x} - n \bar{x} \\ &=0 \end{aligned}

Equipped with this fact, I can show that the new, more symmetric version of the formula is equal to the original one I derived.

\begin{align*} \text{new formula for } w_1^* &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[6pt] \text{original formula for } w_1^* &= \frac{\displaystyle\sum_{i=1}^n x_i\,(y_i-\bar y)}{\displaystyle\sum_{i=1}^n x_i\,(x_i-\bar x)} \\[6pt] \text{numerator of new formula: } \quad &\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \\ &=\sum_{i=1}^n \left( x_i(y_i - \bar{y}) - \bar{x} (y_i - \bar{y}) \right) \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) - \sum_{i=1}^n \bar{x}(y_i - \bar{y}) \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) - \bar{x} \underbrace{\sum_{i=1}^n (y_i - \bar{y})}_{=0} \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) \\ &=\text{numerator of original formula!} \\ \\ \\ \text{denominator of new formula: } \quad &\sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x}) \\ &= \sum_{i=1}^n x_i(x_i - \bar{x}) - \underbrace{\bar{x} \sum_{i=1^n} (x_i - \bar{x})}_\text{same logic as in the denominator case} \\ &= \sum_{i = 1}^n x_i (x_i - \bar{x}) \\ &= \text{denominator of original formula!} \end{align*}

We skipped some steps in the denominator case, since many of them are the same as in the numerator case. Nevertheless, since we’ve shown that the numerators of both formulas are the same and the denominators of both formulas are the same, well, both formulas are the same!

This is not the only other equivalent formula for the slope; for instance, $w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})y_i}{\sum_{i=1}^n (x_i - \bar{x})^2}$ too, and you can verify this using the same logic as in the proof above.

To summarize, the parameters that minimize mean squared error for the simple linear regression model, $h(x_i) = w_0 + w_1 x_i$ , are:

\boxed{w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}}

This is an important result, and you should remember it. There are a lot of symbols above, but just note that given a dataset $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ , you could apply the formulas above by hand to find the optimal parameters yourself.

What does this line look like on the commute times data?

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('data/commute-times.csv')

# Compute means
x = df['departure_hour'].values
y = df['minutes'].values
x_bar = np.mean(x)
y_bar = np.mean(y)

# Compute slope (w1) and intercept (w0) using the closed-form solution
w1 = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar) ** 2)
w0 = y_bar - w1 * x_bar

# Prepare regression line points
x_line = np.array([x.min(), x.max()])
y_line = w0 + w1 * x_line

# Create scatter plot
fig = px.scatter(
    df,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(df)) * 50,
    size_max=8
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)

# Add regression line in orange
fig.add_traces(go.Scatter(
    x=x_line,
    y=y_line,
    mode='lines',
    line=dict(color='orange', width=3),
    name='Regression Line'
))

fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    showlegend=False
)
fig.show(renderer='png', scale=3)

The line above goes by many names:

The simple linear regression line that minimizes mean squared error.
The simple linear regression line (if said without context).
The regression line.
The least squares regression line (because it has the least mean squared error).
The line of best fit.

Whatever you’d like to call it, now that we’ve found our optimal parameters, we can use them to make predictions.

h(x_i) = w_0^* + w_1^* x_i

On the dataset of commute times:

# Assume x is an array with departure hours and y is an array with commute times.
w1_star = np.sum((x - np.mean(x)) * (y - np.mean(y))) / np.sum((x - np.mean(x)) ** 2)
w0_star = np.mean(y) - w1_star * np.mean(x)

w0_star, w1_star

(142.4482415877287, -8.186941724265552)

So, our specific fit, or trained hypothesis function is:

\begin{align*} \text{predicted commute time}_i &= h(\text{departure hour}_i) \\ &= 142.45 - 8.19 \cdot \text{departure hour}_i \end{align*}

This trained hypothesis function is not saying that leaving later causes you to have shorter commutes. Rather, that’s just the best linear pattern it observed in the data for the purposes of minimizing mean squared error. In reality, there are other factors that affect commute times, and we haven’t performed a thorough-enough analysis to say anything about the causal relationship between departure time and commute time.

To predict how long it might take to get to school tomorrow, plug in the time you’d like to leave for $\text{departure hour}_i$ and out will come your prediction. The slope, -8.19, is in units $\frac{\text{units of } y}{\text{units of } x} = \frac{\text{minutes}}{\text{hour}}$ , and is telling us that for every hour later you leave, your predicted commute time decreases by 8.19 minutes.

In Python, I can define a predict function as follows:

def predict(x_new):
    return w0_star + w1_star * x_new

# Predicted commute time if I leave at 8:30AM.
predict(8.5)

72.8592369314715

Regression Line Passes Through the Mean¶

There’s an important property that the regression line satisfies: for any dataset, the line that minimizes mean squared error passes through the point $(\text{mean of } x, \text{mean of } y)$ .

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('data/commute-times.csv')

# Compute means
x = df['departure_hour'].values
y = df['minutes'].values
x_bar = np.mean(x)
y_bar = np.mean(y)

# Compute slope (w1) and intercept (w0) using the closed-form solution
w1 = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar) ** 2)
w0 = y_bar - w1 * x_bar

# Prepare regression line points
x_line = np.array([x.min(), x.max()])
y_line = w0 + w1 * x_line

# Create scatter plot
fig = px.scatter(
    df,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(df)) * 50,
    size_max=8
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)

# Add regression line in orange
fig.add_traces(go.Scatter(
    x=x_line,
    y=y_line,
    mode='lines',
    line=dict(color='orange', width=3),
    name='Regression Line'
))

# Add gold point at (x_bar, y_bar) and label it
fig.add_trace(go.Scatter(
    x=[x_bar],
    y=[y_bar],
    mode='markers+text',
    marker=dict(color='orange', size=14, line=dict(width=1, color='black')),
    text=[r'<b>(x̄, ȳ)</b>'],
    textfont=dict(color='orange', size=18),
    textposition='top right',
    name='Mean Point',
    showlegend=False
))

fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    showlegend=False
)
fig.show(renderer='png', scale=3)

predict(np.mean(x))

73.18461538461538

# Same!
np.mean(y)

73.18461538461538

Our commute times regression line passes through the point $(\bar{x}, \bar{y})$ , even if that was not necessarily one of the original points in the dataset.

Intuitively, this says that for an average input, the line that minimizes mean squared error will always predict an average output.

Why is this fact true? See if you can reason about it yourself, then check the solution once you’ve attempted it.

Proof that the regression line passes through

(\bar{x}, \bar{y})

Our regression line is:

h(x_i) = w_0^* + w_1^* x_i

We know that the optimal intercept of the regression line is:

w_0^* = \bar{y} - w_1^* \bar{x}

Substituting this in yields:

h(x_i) = \underbrace{\bar{y} - w_1^* \bar{x}}_{w_0^*} + w_1^* x_i

If we plug in $x_i = \bar{x}$ , we have:

h(\bar{x}) = \bar{y} \: \underbrace{- w_1^* \bar{x} + w_1^* \bar{x}}_\text{cancels out} = \bar{y}

My interpretation of this is that the intercept is chosen to “vertically adjust” the line so that it passes through $(\bar{x}, \bar{y})$ .

The Modeling Recipe¶

To conclude, let’s run through the three-step modeling recipe.

1. Choose a model.

h(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We chose squared loss:

L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss to find optimal parameters.

For the simple linear regression model, empirical risk is:

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

We showed that:

w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

While the process of minimizing $R_\text{sq}$ was much, much more complex than in the case of our single parameter model, the conceptual backing of the process was still this three-step recipe, and hopefully now you see its value.

Finding the Partial Derivatives¶

Solving for the Optimal Parameters¶

Rewriting and Using the Formulas¶

Regression Line Passes Through the Mean¶

The Modeling Recipe¶