1.4. Simple Linear Regression

import pandas as pd
import numpy as np
import plotly.express as px

df = pd.read_csv('data/commute-times.csv')

fig = px.scatter(
    df,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(df)) * 50,
    size_max=8
)
fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    )
)
fig.show(renderer='png', scale=3)

So far, we’ve studied the constant model, where the hypothesis function is a horizontal line:

h(x_i) = w

The sole parameter, $w$ , controlled the height of the line. Up until now, “parameter” and “prediction” were interchangeable terms, because our sole parameter $w$ controlled what our constant prediction was.

Now, the simple linear regression model has two parameters:

h(x_i) = w_0 + w_1 x_i

$w_0$ controls the intercept of the line, and $w_1$ controls its slope. No longer is it the case that “parameter” and “prediction” are interchangeable terms, because $w_0$ and $w_1$ control different aspects of the prediction-making process.

How do we find the optimal parameters, $w_0^*$ and $w_1^*$ ? Different values of $w_0$ and $w_1$ give us different lines, each of which fit the data with varying degrees of accuracy.

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df = pd.read_csv('data/commute-times.csv')

x = df['departure_hour'].values
y = df['minutes'].values

# Define (slope, intercept) pairs for three lines
lines = [
    (-10, 160),      # h(x) = 70
    (5, 20),      # h(x) = 5x + 20
    (-3, 100)     # h(x) = -3x + 100
]

# Titles for each subplot
subplot_titles = [
    r"$h(x_i) = 160 - 10x_i$",
    r"$h(x_i) = 20 + 5x_i$",
    r"$h(x_i) = 100 - 3x_i$"
]

fig = make_subplots(rows=1, cols=3, subplot_titles=subplot_titles)

x_line = np.linspace(x.min(), x.max(), 100)

for i, (slope, intercept) in enumerate(lines, start=1):
    fig.add_trace(
        go.Scatter(
            x=x, 
            y=y, 
            mode='markers', 
            name='Data', 
            marker=dict(
                color='#3D81F6', 
                size=8,
            ),
            showlegend=False
        ),
        row=1, col=i
    )

    y_line = slope * x_line + intercept
    fig.add_trace(
        go.Scatter(
            x=x_line, 
            y=y_line, 
            mode='lines', 
            name=f'$h(x)={slope}x+{intercept}$', 
            line=dict(color='orange', width=4)
        ),
        row=1, col=i
    )

fig.update_layout(
    showlegend=False,
    xaxis_title='Departure Hour (AM)',
    yaxis_title='Commute Time (Minutes)',
    plot_bgcolor='white',
    paper_bgcolor='white',
    width=700,
    height=450,
    legend=dict(
        bgcolor='rgba(0,0,0,0)',
        bordercolor='rgba(0,0,0,0)',
        font=dict(size=14)
    ),
    margin=dict(l=60, r=30, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    )
)

fig.update_xaxes(
    showgrid=True,
    gridwidth=1,
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    showgrid=True,
    gridwidth=1,
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)

fig.show(renderer='png', scale=4)

Activity 1

Consider a dataset with two points, $(3, 5)$ and $(15, 53)$ . What are the optimal parameters, $w_0^*$ and $w_1^*$ , for the line $h(x_i) = w_0 + w_1 x_i$ that minimizes mean squared error for this dataset?

To make things precise, let’s turn to the three-step modeling recipe from Chapter 1.3.

1. Choose a model.

h(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We’ll stick with squared loss:

L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss (also known as empirical risk) to find optimal parameters.

Average squared loss – also known as mean squared error – for any hypothesis function $h$ , takes the form:

\frac{1}{n} \sum_{i=1}^n (y_i - h(x_i))^2

For the simple linear regression model, this becomes:

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Now, we need to find the values of $w_0$ and $w_1$ that together minimize $R_\text{sq}(w_0, w_1)$ . But what does that even mean?

In the case of the context model and squared loss, where we had to minimize $R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2$ , we did so by taking the derivative with respect to $w$ and setting it to 0.

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df = pd.read_csv('data/commute-times.csv')

f = lambda h: ((72-h)**2 + (90-h)**2 + (61-h)**2 + (85-h)**2 + (92-h)**2) / 5

x = np.linspace(50, 110, 100)
y = np.array([f(h) for h in x])

# Calculate mean and variance
data = np.array([72, 90, 61, 85, 92])
mean = np.mean(data)
variance = np.mean((data - mean) ** 2)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=x, 
        y=y, 
        mode='lines', 
        name='Data', 
        line=dict(color='#D81B60', width=4)
    )
)

# Draw a point at the vertex (mean, variance)
# fig.add_trace(
#     go.Scatter(
#         x=[mean],
#         y=[variance],
#         mode='markers+text',
#         marker=dict(color='#D81B60', size=14, symbol='circle'),
#         text=[f"<span style='font-family:Palatino, Palatino Linotype, serif; color:#D81B60'>(mean, variance)</span>"],
#         textposition="top center",
#         showlegend=False
#     )
# )

fig.update_xaxes(
    showticklabels=False,
    showgrid=True,
    gridwidth=1,
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)

fig.update_yaxes(
    showgrid=True,
    gridwidth=1,
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
    showticklabels=False
)

fig.update_layout(
    xaxis_title=r'$w$',
    yaxis_title=r'$R_\text{sq}(w)$',
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        # color="black"
    ),
    showlegend=False
)

fig.show(renderer='png', scale=4)

$R_\text{sq}(w)$ was a function with just a single input variable ( $w$ ), so the problem of minimizing $R_\text{sq}(w)$ was straightforward, and resembled problems we solved in Calculus 1.

The function $R_\text{sq}(w_0, w_1)$ we’re minimizing now has two input variables, $w_0$ and $w_1$ . In mathematics, sometimes we’ll write $R_\text{sq}: \mathbb{R}^2 \to \mathbb{R}$ to say that $R_\text{sq}$ is a function that takes in two real numbers and returns a single real number.

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Remember, we should treat the $x_i$ ’s and $y_i$ ’s as constants, as these are known quantities once we’re given a dataset.

What does $R_\text{sq}(w_0, w_1)$ even look like? We need three dimensions to visualize it – one axis for $w_0$ , one for $w_1$ , and one for the output, $R_\text{sq}(w_0, w_1)$ .

import pandas as pd
import numpy as np
import plotly.graph_objects as go

def slope(x, y):
    # Assume x and y are two Series.
    numerator = ((x - np.mean(x)) * (y - np.mean(y))).sum()
    denominator = ((x - np.mean(x)) ** 2).sum()
    return numerator / denominator

def intercept(x, y):
    return y.mean() - slope(x, y) * x.mean()

df = pd.read_csv('data/commute-times.csv')

w0_star = intercept(df['departure_hour'], df['minutes'])
w1_star = slope(df['departure_hour'], df['minutes'])

def mse(y_actual, y_pred):
    return np.mean((y_actual - y_pred)**2)

def mse_for_departure_model(w):
    w0, w1 = w
    return mse(df['minutes'], w0 + w1 * df['departure_hour'])

num_points = 50 # increase for better resolution, but it will run more slowly. 
uvalues = np.linspace(120, 160, num_points)
vvalues = np.linspace(-13, -3, num_points)
(u, v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(), v.flatten()))
MSE = np.array([mse_for_departure_model(t) for t in thetas.T])
loss_surface = go.Surface(
    x=u, y=v, z=np.reshape(MSE, u.shape),
    colorscale='PuRd',
    showscale=False,
    opacity=1,
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
minimizer = go.Scatter3d(
    x=[w0_star], y=[w1_star], z=[mse_for_departure_model([w0_star, w1_star])],
    mode='markers', name='optimal parameters',
    marker=dict(size=10, color='gold'),
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
fig = go.Figure(data=[loss_surface, minimizer])

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title="w₀",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title="w₁",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title="R(w₀, w₁)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    scene_camera=dict(
        eye=dict(x=-1.5, y=1.5, z=1.0)
    ),
)

fig.show(scale=4)

Loading...

The graph above is called a loss surface, even though it’s a graph of empirical risk, i.e. average loss, not the loss for a single data point. The plot is interactive, so you should drag it around to get a sense of what it looks like. It looks like a parabola with added depth, similar to how cubes look like squares with added depth. Lighter regions above correspond to low mean squared error, and darker regions correspond to high mean squared error.

Think of the “floor” of the graph – in other words, the $w_0$ - $w_1$ plane – as all the set of possible combinations of intercept and slope. The height of the surface at any point $(w_0, w_1)$ is the mean squared error of the hypothesis $h(x_i) = w_0 + w_1 x_i$ on the commute times dataset.

Our goal is to find the combination of $w_0$ and $w_1$ that get us to the bottom of the surface, marked by the gold point in the plot. Somehow, this will involve calculus and derivatives, but we’ll need to extend our single variable approach.

Functions of Multiple Variables¶

Partial Derivatives¶

How do we take the derivative of a function with multiple input variables?

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

To illustrate, let’s focus on a simpler function with two input variables:

f(x,y) = \frac{x^2 + y^2}{9}

This is a quadratic function of two variables, and its graph is known as a paraboloid.

import numpy as np
import plotly.graph_objects as go

# Grid for paraboloid
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
a = 3
Z = (X**2 + Y ** 2) / a**2

fig = go.Figure()

# Paraboloid only, with RdBu_r colorscale
fig.add_trace(go.Surface(
    z=Z, x=X, y=Y, 
    colorscale="PuRd", 
    opacity=1, 
    name="Paraboloid", 
    showscale=False
))

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title=r"x",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title=r"y",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title=r"f(x,y)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    scene_camera=dict(
        eye=dict(x=1, y=-1.5, z=1.2)
    )
)

fig.show()

Loading...

In the single-input case – i.e., for functions of the form $f: \mathbb{R} \to \mathbb{R}$ – the derivative $\frac{\text{d}}{\text{d}x}f(x)$ captured $f(x)$ 's rate of change along the $x$ -axis, which was the only axis of motion.

The function $f(x, y)$ has two input variables, and so there are two directions along which we can move. As such, we need two “derivatives” to describe the rate of change of $f(x, y)$ – one for the $x$ -axis and one for the $y$ -axis. Think of this as a science experiment, where we need control variables to isolate changes to a single variable. Our solution to this dilemma comes in the form of partial derivatives.

If $f$ has $n$ input variables, it has $n$ partial derivatives, one for each axis. The function $f(x, y) = \frac{x^2 + y^2}{9}$ has two partial derivatives, $\frac{\partial f}{\partial x}(x, y)$ and $\frac{\partial f}{\partial y}(x, y)$ . (The symbol you’re seeing, $\partial$ , is the lowercase Greek letter delta, and is used specifically for partial derivatives.)

Let me show you how to compute partial derivatives before we visualize them. We’ll start with $\frac{\partial f}{\partial x}(x, y)$ .

\begin{align*} f(x,y) &= \frac{x^2 + y^2}{9} \\ \frac{\partial f}{\partial x}(x, y) &=\frac{\partial}{\partial x}\!\left(\frac{x^2+y^2}{9}\right) \\[4pt] &=\frac{1}{9}\,\frac{\partial}{\partial x}(x^2+y^2) \\ &=\frac{1}{9}\!\left(\frac{\partial}{\partial x}x^2+ \underbrace{\frac{\partial}{\partial x}y^2}_{=0}\right) \\ &=\frac{1}{9}\,(2x+0) \\ &=\frac{2x}{9} \end{align*}

The result, $\frac{\partial f}{\partial x}(x, y) = \frac{2x}{9}$ , is a function of $x$ and $y$ . It tells us the rate of change of $f(x,y)$ along the $x$ axis, at any point $(x, y)$ . It just so happens that this function doesn’t involve $y$ since we chose a relatively simple function $f$ , but we’ll see more sophisticated examples soon.

Following similar steps, you’ll see that $\frac{\partial f}{\partial y}(x, y) = \frac{2y}{9}$ . This gives us:

\frac{\partial f}{\partial x}(x, y) = \frac{2x}{9}, \quad \frac{\partial f}{\partial y}(x, y) = \frac{2y}{9}

Let’s pick an arbitrary point and see what the partial derivatives tell us about it. Consider, say, $(-3, 0.5)$ :

$\frac{\partial f}{\partial x}(-3, 0.5) = \frac{2(-3)}{9} = -\frac{2}{3}$ , so if we hold $\: \color{orange} y \:$ constant, $\color{orange} f$ decreases as $\: \color{orange} x \:$ increases.
$\frac{\partial f}{\partial y}(-3, 0.5) = \frac{2(0.5)}{9} = \frac{1}{9}$ , so if we hold $\: \color{#3d81f6} x \:$ constant, $\color{#3d81f6} f$ increases as $\: \color{#3d81f6} y \:$ increases.

import numpy as np
import plotly.graph_objects as go

# Grid for paraboloid
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
a = 3
Z = (X**2 + Y ** 2) / a**2

fig = go.Figure()

# Paraboloid only, with PuRd colorscale, add to legend stack, no colorbar
fig.add_trace(go.Surface(
    z=Z, x=X, y=Y, 
    colorscale="PuRd", 
    opacity=0.8, 
    name="f(x, y)", 
    showscale=False,
    showlegend=True
))

# Annotate the point (-3, 2)
x0, y0 = -3, 0.5
z0 = (x0**2 + y0**2) / a**2

fig.add_trace(go.Scatter3d(
    x=[x0],
    y=[y0],
    z=[z0],
    mode='markers+text',
    marker=dict(size=8, color='black'),
    text=["(-3, 0.5)"],
    textfont=dict(color='black', size=12),
    textposition="top center",
    name="(-3, 0.5)",
    showlegend=False
))

# Add gold line for Z(-3, y): x = -3, y varies, legend: path of f(-3, y)
y_line_full = np.linspace(-5, 5, 100)
x_line_full = np.full_like(y_line_full, -3)
z_line_full = (x_line_full**2 + y_line_full**2) / a**2
fig.add_trace(go.Scatter3d(
    x=x_line_full,
    y=y_line_full,
    z=z_line_full,
    mode='lines',
    line=dict(color='gold', width=8),
    name="path of f(-3, y)",
    showlegend=True,
    visible='legendonly'
))

# Add gold line for Z(x, 0.5): y = 0.5, x varies, legend: path of f(x, 0.5)
x_line_full2 = np.linspace(-5, 5, 100)
y_line_full2 = np.full_like(x_line_full2, 0.5)
z_line_full2 = (x_line_full2**2 + y_line_full2**2) / a**2
fig.add_trace(go.Scatter3d(
    x=x_line_full2,
    y=y_line_full2,
    z=z_line_full2,
    mode='lines',
    line=dict(color='gold', width=8),
    name="path of f(x, 0.5)",
    showlegend=True,
    visible='legendonly'
))

# Calculate partial derivatives at (-3, 2)
# f(x,y) = (x² + y²)/a²
# ∂f/∂x = 2x/a²
# ∂f/∂y = 2y/a²
dfdx_at_point = 2 * x0 / a**2  # = 2*(-3)/9 = -2/3
dfdy_at_point = 2 * y0 / a**2  # = 2*(2)/9 = 4/9

# Create tangent line in x-direction at (-3, 2)
# Tangent line: z = z0 + dfdx_at_point * (x - x0)
x_tangent_x = np.linspace(x0 - 1.5, x0 + 1.5, 20)
y_tangent_x = np.full_like(x_tangent_x, y0)
z_tangent_x = z0 + dfdx_at_point * (x_tangent_x - x0)

fig.add_trace(go.Scatter3d(
    x=x_tangent_x,
    y=y_tangent_x,
    z=z_tangent_x,
    mode='lines',
    line=dict(color='orange', width=8),
    name="tangent line in x-direction (steep, negative slope)",
    showlegend=True
))

# Create tangent line in y-direction at (-3, 2)
# Tangent line: z = z0 + dfdy_at_point * (y - y0)
y_tangent_y = np.linspace(y0 - 1.5, y0 + 1.5, 20)
x_tangent_y = np.full_like(y_tangent_y, x0)
z_tangent_y = z0 + dfdy_at_point * (y_tangent_y - y0)

fig.add_trace(go.Scatter3d(
    x=x_tangent_y,
    y=y_tangent_y,
    z=z_tangent_y,
    mode='lines',
    line=dict(color='#3d81f6', width=8),
    name="tangent line in y-direction (shallow, positive slope)",
    showlegend=True
))

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title=r"x",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title=r"y",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title=r"f(x,y)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=600,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    legend=dict(
        orientation="v",
        yanchor="bottom",
        y=-0.5,
        xanchor="center",
        x=0.5
    ),
    scene_camera=dict(
        eye=dict(x=1, y=-1.5, z=1.2)
    )
)

fig.show()

Loading...

Above, we’ve shown the tangent lines in both the $x$ and $y$ directions at the point $(-3, 0.5)$ . After all, the derivative of a function at a point tells us the slope of the tangent line at that point; that interpretation remains true with partial derivatives.

Let’s look at a more complex example. Consider:

g(x, y) = x^3 - 3xy^2 + 2 \sin(x) \cos(y)

import numpy as np
import plotly.graph_objects as go

def g(x, y):
    return x**3 - 3*x*y**2 + 2*np.sin(x)*np.cos(y)

x = np.linspace(-3, 3, 51)
y = np.linspace(-3, 3, 51)
X, Y = np.meshgrid(x, y)
Z = g(X, Y)

surface = go.Surface(
    x=X, y=Y, z=Z,
    colorscale='PuRd',
    opacity=1,
    showscale=False,
)

fig = go.Figure(data=[surface])

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title="x",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        yaxis=dict(
            title="y",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        zaxis=dict(
            title="g(x, y)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400
)

fig.show(scale=4)

Loading...

Both partial derivatives are functions of both $x$ and $y$ , which is typically what we’ll see.

\begin{align*} g(x, y) &= x^3 - 3xy^2 + 2 \sin(x) \cos(y) \\ \frac{\partial g}{\partial x}(x, y) &= 3x^2 - 3y^2 + 2 \cos(x) \cos(y) \\ \frac{\partial g}{\partial y}(x, y) &= -6xy - 2 \sin(x) \sin(y) \end{align*}

To compute $\frac{\partial g}{\partial x}(x, y)$ , we treated $y$ as a constant. Let me try and make more sense of this.

To help visualize, we’ve drawn the function $\color{#d81b60} g(x, y)$ , along with the plane $\color{#3d81f6} y = a$ . The slider lets you change the value of $\color{#3d81f6} a$ being considered, i.e., it lets you change the constant value that we’re assigning to $y$ .

The intersection of $\color{#3d81f6} g(x, y)$ and $\color{#3d81f6} y = a$ is marked as a gold curve and is a function of $x$ alone.

import numpy as np
import plotly.graph_objects as go

title_maker = lambda x, y: f"<span style='color: #d81b60; font-weight: bold;'>g(x, y) = x³ - 3xy² + 2sin(x)cos(y)</span><br>on <span style='color: #3d81f6; font-weight: bold;'>the plane y = {y:.2f}</span>, <span style='color: gold; font-weight: bold;'>g(x, {y:.2f}) = x³ - 3x({y:.2f})² + 2sin(x)cos({y:.2f})</span><br>" + '&nbsp;' * 39 + f"∂g/∂x(x, <span style='color: #3d81f6'>{y:.2f}</span>) = 3x² - 3(<span style='color: #3d81f6'>{y:.2f}</span>)² + 2cos(x)cos(<span style='color: #3d81f6'>{y:.2f}</span>)"

def g(x, y):
    return x**3 - 3*x*y**2 + 2*np.sin(x)*np.cos(y)

x = np.linspace(-3, 3, 16)
y = np.linspace(-3, 3, 16)
X, Y = np.meshgrid(x, y)
Z = g(X, Y)

def slicing_plane(y0):
    X_plane = np.linspace(-3, 3, 16)
    Y_plane = np.full_like(X_plane, y0)
    Z_plane = g(X_plane, Y_plane)
    return X_plane, Y_plane, Z_plane

y0_init = 0.0
X_plane, Y_plane, Z_plane = slicing_plane(y0_init)

surface = go.Surface(
    x=X, y=Y, z=Z,
    colorscale='PuRd',
    opacity=1,
    showscale=False,
    name='Surface'
)

slice_curve = go.Scatter3d(
    x=X_plane, y=Y_plane, z=Z_plane,
    mode='lines',
    line=dict(color='gold', width=12),
    # name='Slice (y = y₀)'
)

plane_z = np.linspace(np.min(Z), np.max(Z), 2)
plane_x = np.linspace(-3, 3, 2)
plane_y = np.full((2, 2), y0_init)
plane_x_grid, plane_z_grid = np.meshgrid(plane_x, plane_z)
plane = go.Surface(
    x=plane_x_grid, y=plane_y, z=plane_z_grid,
    showscale=False,
    opacity=0.4,
    colorscale=[[0, '#3d81f6'], [1, '#3d81f6']],
    name='Slicing Plane'
)

fig = go.Figure(data=[surface, plane, slice_curve])

steps = []
y0_values = np.linspace(-3, 3, 16)
for i, y0 in enumerate(y0_values):
    Xp, Yp, Zp = slicing_plane(y0)
    plane_y_new = np.full((2, 2), y0)
    eq_text = title_maker(0, y0)
    step = dict(
        method="update",
        args=[
            {
                "x": [X, plane_x_grid, Xp],
                "y": [Y, plane_y_new, Yp],
                "z": [Z, plane_z_grid, Zp]
            },
            {
                "annotations": [dict(
                    text=eq_text,
                    xref="paper", yref="paper",
                    x=0.02, y=0.98, showarrow=False,
                    font=dict(size=15, family="Palatino"),
                    align="left",
                    bgcolor="white"
                )]
            }
        ],
        label=f"{y0:.2f}"
    )
    steps.append(step)

sliders = [dict(
    active=len(y0_values)//2,
    currentvalue={
        "prefix": "<span style='color: #3d81f6; font-weight: bold;'>Slice at y=</span>",
        "font": dict(family="Palatino", size=14, color="#3d81f6")
    },
    pad={"t": 30},
    steps=steps,
    font=dict(family="Palatino", size=14, color="black")
)]

init_eq = title_maker(0, y0_init)

fig.add_annotation(
    text=init_eq,
    xref="paper", yref="paper",
    x=0.02, y=0.98, showarrow=False,
    font=dict(size=15, family="Palatino"),
    align="left",
)

fig.update_layout(
    sliders=sliders,
    scene=dict(
        xaxis=dict(
            title="x",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        yaxis=dict(
            title="y",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        zaxis=dict(
            title="g(x, y)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=700,
    height=700
)

fig.show()

Loading...

Drag the slider to $\color{#3d81f6} y = 1.40$ , for example, and look at the gold curve that results. The expression below tells you the derivative of that gold curve with respect to $x$ .

\frac{\partial g}{\partial x}(x, {\color{#3d81f6}1.40}) = 3x^2 - 3({\color{#3d81f6}1.40})^2 + 2 \cos(x) \cos({\color{#3d81f6}1.40}) = \underbrace{3x^2 - 0.34 \cos(x) - 5.88}_\text{derivative of {\color{gold}\textbf{gold curve}} w.r.t. $x$}

Thinking in three dimensions can be difficult, so don’t fret if you’re confused as to what all of these symbols mean – this is all a bit confusing to me too. (Are professors allowed to say this?) Nonetheless, I hope these interactive visualizations are helping you make some sense of the formulas, and if there’s anything I can do to make them clearer, please do tell me!

Activity 2

Find all three partial derivatives of the function:

g(x, y, z) = 2x^2 + y^2 + 3z^2 + 2xy^2 - 4yz + 6x - 4z - 10

Optimization¶

To minimia (or maximize) a function $f: \mathbb{R} \to \mathbb{R}$ , we solved for critical points, which were points where the (single variable) derivative was 0, and used the second derivative test to classify them as minima or maxima (or neither, like in the case of $f(x) = x^3$ at $x = 0$ ).

The analog in the $\mathbb{R}^2 \rightarrow \mathbb{R}$ case is solving for the points where both partial derivatives are 0, which corresponds to the points where the function is neither increasing nor decreasing along either axis.

In the case of our first example,

f(x, y) = \frac{x^2 + y^2}{9}

the partial derivatives were relatively simple,

\frac{\partial f}{\partial x} = \frac{2x}{9}, \quad \frac{\partial f}{\partial y} = \frac{2y}{9}

and both are 0 when $x = y = 0$ . So, $(0, 0, f(0))$ is a critical point, and we can see visually that it’s a global minimum.

(Notice that above I wrote $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ instead of $\frac{\partial f}{\partial x}(x, y)$ and $\frac{\partial f}{\partial y}(x, y)$ to save space, but don’t forget that both partial derivatives are functions of both $x$ and $y$ in general.)

import numpy as np
import plotly.graph_objects as go

# Grid for paraboloid
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
a = 3
Z = (X**2 + Y ** 2) / a**2

fig = go.Figure()

# Paraboloid only, with RdBu_r colorscale
fig.add_trace(go.Surface(
    z=Z, x=X, y=Y, 
    colorscale="PuRd", 
    opacity=1, 
    name="Paraboloid", 
    showscale=False
))

# Add gold point at (0, 0, 0) and label it "global minimum"
fig.add_trace(go.Scatter3d(
    x=[0],
    y=[0],
    z=[0],
    mode='markers+text',
    marker=dict(size=8, color='gold', symbol='circle'),
    # text=["global minimum"],
    textposition="top center",
    name="Global Minimum"
))

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title=r"x",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title=r"y",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title=r"f(x,y)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    scene_camera=dict(
        eye=dict(x=1, y=-1.5, z=1.2)
    )
)

fig.show()

Loading...

There is a second derivative test for functions of multiple variables, but it’s a bit more complicated than the single variable case, and to give you an honest explanation of it, I’ll need to introduce you to quite a bit of linear algebra first. So, we’ll table that thought for now.

The function $g(x, y) = x^3 - 3xy^2 + 2 \sin(x) \cos(y)$ has much more complicated partial derivatives, and so it’s difficult to solve for its critical points by hand. Fear not – in Chapter 4, when we discover the technique of gradient descent, we’ll learn how to minimize such functions just by using their partial derivatives, even when we can’t solve for where they’re 0.

Activity 3

Find the values of $x_1$ and $x_2$ that minimize the function:

g(x_1, x_2) = 100(x_2 - x_1^2)^2 + (1 - x_1)^2

Here, we’ve used $x_1$ and $x_2$ to denote the two input variables, rather than $x$ and $y$ .

Minimizing Mean Squared Error¶

Finding the Partial Derivatives¶

Let’s return to the simple linear regression problem. Recall, the function we’re trying to minimize is:

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Why? By minimizing $R_\text{sq}(w_0, w_1)$ , we’re finding the intercept ( $w_0^*$ ) and slope ( $w_1^*$ ) of the line that best fits the data. Don’t forget that this goal is the point of all of these mathematical ideas.

import pandas as pd
import numpy as np
import plotly.graph_objects as go

def slope(x, y):
    # Assume x and y are two Series.
    numerator = ((x - np.mean(x)) * (y - np.mean(y))).sum()
    denominator = ((x - np.mean(x)) ** 2).sum()
    return numerator / denominator

def intercept(x, y):
    return y.mean() - slope(x, y) * x.mean()

df = pd.read_csv('data/commute-times.csv')

w0_star = intercept(df['departure_hour'], df['minutes'])
w1_star = slope(df['departure_hour'], df['minutes'])

def mse(y_actual, y_pred):
    return np.mean((y_actual - y_pred)**2)

def mse_for_departure_model(w):
    w0, w1 = w
    return mse(df['minutes'], w0 + w1 * df['departure_hour'])

num_points = 50 # increase for better resolution, but it will run more slowly. 
uvalues = np.linspace(120, 160, num_points)
vvalues = np.linspace(-13, -3, num_points)
(u, v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(), v.flatten()))
MSE = np.array([mse_for_departure_model(t) for t in thetas.T])
loss_surface = go.Surface(
    x=u, y=v, z=np.reshape(MSE, u.shape),
    colorscale='PuRd',
    showscale=False,
    opacity=1,
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
minimizer = go.Scatter3d(
    x=[w0_star], y=[w1_star], z=[mse_for_departure_model([w0_star, w1_star])],
    mode='markers', name='optimal parameters',
    marker=dict(size=10, color='gold'),
    hovertemplate="w₀: %{x}<br>w₁: %{y}<br>R(w₀, w₁): %{z}<extra></extra>"
)
fig = go.Figure(data=[loss_surface, minimizer])

fig.update_layout(
    scene=dict(
        xaxis=dict(
            title="w₀",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        yaxis=dict(
            title="w₁",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        zaxis=dict(
            title="R(w₀, w₁)",
            backgroundcolor="white",
            gridcolor="#f0f0f0",
            showbackground=True,
            showline=True,
            linecolor="black",
            linewidth=1,
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=600,
    height=400,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    scene_camera=dict(
        eye=dict(x=-1.5, y=1.5, z=1.0)
    ),
)

fig.show(scale=4)

Loading...

We’ve learned that to minimize $R_\text{sq}(w_0, w_1)$ , we’ll need to find both of its partial derivatives, and solve for the point $(w_0^*, w_1^*, R_\text{sq}(w_0^*, w_1^*))$ at which they’re both 0.

Let’s start with the partial derivative with respect to $w_0$ :

\begin{align*} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_0} &= \frac{\partial}{\partial w_0} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_0} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_0}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-1) \\ &=-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \end{align*}

Onto $w_1$ :

\begin{aligned} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_1} &= \frac{\partial}{\partial w_1} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_1} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_1}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-x_i) \\ &=-\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

All in one place now:

\begin{aligned} &\frac{\partial R_{\text{sq}}}{\partial w_0} = -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \\ &\frac{\partial R_{\text{sq}}}{\partial w_1} = -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

These look very similar – it’s just $\frac{\partial R_{\text{sq}}}{\partial w_1}$ has an added $x_i$ in the summation.

Remember, both partial derivatives are functions of two variables: $w_0$ and $w_1$ . We’re treating the $x_i$ ’s and $y_i$ ’s as constants. If I already have a dataset, you can pick an intercept $w_0$ and slope $w_1$ and I can use these formulas to compute the partial derivatives of $R_\text{sq}$ for that combination of intercept and slope.

In case it helps you put things in perspective, here’s how I might implement these formulas in code, assuming that x and y are arrays:

# Assume x and y are defined somewhere above this function.
def partial_R_w0(w0, w1):
    # Sub-optimal technique, since it uses a for-loop.
    total = 0
    for i in range(len(x)):
        total += (y[i] - (w0 + w1 * x[i]))
    return -2 * total / len(x)
    # Returns a single number!

def partial_R_w1(w0, w1):
    # Better technique, as it uses vectorized operations.
    return -2 * np.mean(x * (y - (w0 + w1 * x)))
    # Also returns a single number!

Before we solve for where both $\frac{\partial R_{\text{sq}}}{\partial w_0}$ and $\frac{\partial R_{\text{sq}}}{\partial w_1}$ are 0, let’s visualize them in the context of our loss surface.

import pandas as pd
import numpy as np
import plotly.graph_objects as go

title_maker = lambda w0, w1, axis: (
    "<span style='color: #d81b60; font-weight: bold;'>R(w₀, w₁)"
)

def slope(x, y):
    numerator = ((x - np.mean(x)) * (y - np.mean(y))).sum()
    denominator = ((x - np.mean(x)) ** 2).sum()
    return numerator / denominator

def intercept(x, y):
    return y.mean() - slope(x, y) * x.mean()

df = pd.read_csv('data/commute-times.csv')

def mse(y_actual, y_pred):
    return np.mean((y_actual - y_pred)**2)

def mse_for_departure_model(w):
    w0, w1 = w
    return mse(df['minutes'], w0 + w1 * df['departure_hour'])

num_points = 50
w0_vals = np.linspace(90, 190, num_points)
w1_vals = np.linspace(-13, -3, num_points)
W0, W1 = np.meshgrid(w0_vals, w1_vals)
thetas = np.vstack((W0.flatten(), W1.flatten()))
MSE = np.array([mse_for_departure_model(t) for t in thetas.T])
Z = np.reshape(MSE, W0.shape)

def slice_curve(axis, c):
    if axis == "w0":
        v_vals = np.linspace(-13, -3, 100)
        z_vals = [mse_for_departure_model((c, v)) for v in v_vals]
        return np.full_like(v_vals, c), v_vals, z_vals
    elif axis == "w1":
        u_vals = np.linspace(90, 190, 100)
        z_vals = [mse_for_departure_model((u, c)) for u in u_vals]
        return u_vals, np.full_like(u_vals, c), z_vals

def slice_plane(axis, c):
    if axis == "w0":
        Vp, Zp = np.meshgrid(np.linspace(-13, -3, 50), np.linspace(Z.min(), Z.max(), 50))
        Up = np.full_like(Vp, c)
        return Up, Vp, Zp
    elif axis == "w1":
        Up, Zp = np.meshgrid(np.linspace(90, 190, 50), np.linspace(Z.min(), Z.max(), 50))
        Vp = np.full_like(Up, c)
        return Up, Vp, Zp

def make_traces(axis):
    c0 = 140 if axis == "w0" else -8
    xs, ys, zs = slice_curve(axis, c0)
    Xp, Yp, Zp = slice_plane(axis, c0)

    plane_trace = go.Surface(
        x=Xp, y=Yp, z=Zp,
        showscale=False,
        opacity=0.4,
        colorscale=[[0, '#3d81f6'], [1, '#3d81f6']],
        visible=True,
        name=f"Slicing Plane"
    )
    curve_trace = go.Scatter3d(
        x=xs, y=ys, z=zs,
        mode="lines",
        line=dict(color="gold", width=12),
        visible=True,
        name=f"Slice Curve"
    )
    return plane_trace, curve_trace, c0

plane_w0, curve_w0, c0_w0 = make_traces("w0")
plane_w1, curve_w1, c0_w1 = make_traces("w1")
plane_w1.visible = False
curve_w1.visible = False

# Loss surface
loss_surface = go.Surface(
    x=W0, y=W1, z=Z,
    colorscale="PuRd",
    opacity=1,
    name="Loss Surface",
    showscale=False
)

fig = go.Figure(data=[loss_surface, plane_w0, curve_w0, plane_w1, curve_w1])

# --- Slider steps with actual formulas ---
def make_steps(axis):
    if axis == "w0":
        c_values = np.linspace(90, 190, 21)
        variable_label = "w₁"  # free variable
        constant_label = "w₀"  # held constant by slider
    else:
        c_values = np.linspace(-13, -3, 21)
        variable_label = "w₀"  # free variable
        constant_label = "w₁"  # held constant by slider

    steps = []
    for c in c_values:
        xs, ys, zs = slice_curve(axis, c)
        Xp, Yp, Zp = slice_plane(axis, c)

        # Generate title with proper formatting
        if axis == "w0":
            eq_text = title_maker(c, -8, axis)  # w0 is constant, w1 varies
            curve_label_text = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"
        else:
            eq_text = title_maker(140, c, axis)  # w1 is constant, w0 varies
            curve_label_text = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₀ only!"

        steps.append(dict(
            method="update",
            args=[{"x": [W0, Xp, xs, Xp, xs],
                   "y": [W1, Yp, ys, Yp, ys],
                   "z": [Z, Zp, zs, Zp, zs]},
                  {"annotations": [
                      dict(
                          text=eq_text,
                          xref="paper", yref="paper",
                          x=0.02, y=0.98,  # top-left
                          showarrow=False,
                          font=dict(size=14, family="Palatino", color="black"),
                          align="left"),
                      dict(
                          text=curve_label_text,
                          xref="paper", yref="paper",
                          x=0.5, y=0.6,  # just below the equation
                          showarrow=False,
                          font=dict(size=15, family="Palatino"),
                          align="left")
                  ]}],
            label=f"{c:.1f}" if axis == "w1" else f"{c:.0f}"
        ))
    return steps

steps_w0 = make_steps("w0")
steps_w1 = make_steps("w1")
sliders_w0 = [dict(
    active=10, 
    currentvalue={
        "prefix": "<span style='color: #3d81f6; font-weight: bold;'>Slice at w₀=</span>",
        "font": dict(family="Palatino", size=14, color="#3d81f6")
    },
    pad={"t": 30},
    steps=steps_w0,
    font=dict(family="Palatino", size=14, color="black")
)]
sliders_w1 = [dict(
    active=10, 
    currentvalue={
        "prefix": "<span style='color: #3d81f6; font-weight: bold;'>Slice at w₁=</span>",
        "font": dict(family="Palatino", size=14, color="#3d81f6")
    },
    pad={"t": 30},
    steps=steps_w1,
    font=dict(family="Palatino", size=14, color="black")
)]

# --- Updatemenu for toggling axis (top-right corner) ---
# Get initial values for immediate updates
init_w0_xs, init_w0_ys, init_w0_zs = slice_curve("w0", 140)
init_w0_Xp, init_w0_Yp, init_w0_Zp = slice_plane("w0", 140)
init_w0_eq = title_maker(140, -8, "w0")
init_w0_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"

init_w1_xs, init_w1_ys, init_w1_zs = slice_curve("w1", -8)
init_w1_Xp, init_w1_Yp, init_w1_Zp = slice_plane("w1", -8)
init_w1_eq = title_maker(140, -8, "w1")
init_w1_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₀ only!"

fig.update_layout(
    updatemenus=[dict(
        type="buttons",
        buttons=[
            dict(label="Slider for values of w₀",
                 method="update",
                 args=[{"visible": [True, True, True, False, False],
                        "x": [W0, init_w0_Xp, init_w0_xs, init_w0_Xp, init_w0_xs],
                        "y": [W1, init_w0_Yp, init_w0_ys, init_w0_Yp, init_w0_ys],
                        "z": [Z, init_w0_Zp, init_w0_zs, init_w0_Zp, init_w0_zs]},
                       {"sliders": sliders_w0,
                        "annotations": [
                            dict(
                                text=init_w0_eq,
                                xref="paper", yref="paper",
                                x=0.5, y=0.6,
                                showarrow=False,
                                font=dict(size=15, family="Palatino"),
                                align="left"),
                            dict(
                                text=init_w0_curve_label,
                                xref="paper", yref="paper",
                                x=0.02, y=0.88,
                                showarrow=False,
                                font=dict(size=12, family="Palatino"),
                                align="left",
                                bgcolor="white")
                        ]}]),
            dict(label="Slider for values of w₁",
                 method="update",
                 args=[{"visible": [True, False, False, True, True],
                        "x": [W0, init_w1_Xp, init_w1_xs, init_w1_Xp, init_w1_xs],
                        "y": [W1, init_w1_Yp, init_w1_ys, init_w1_Yp, init_w1_ys],
                        "z": [Z, init_w1_Zp, init_w1_zs, init_w1_Zp, init_w1_zs]},
                       {"sliders": sliders_w1,
                        "annotations": [
                            dict(
                                text=init_w1_eq,
                                xref="paper", yref="paper",
                                x=0.02, y=0.98,
                                showarrow=False,
                                font=dict(size=14, family="Palatino"),
                                align="left",
                                bgcolor="white"),
                            dict(
                                text=init_w1_curve_label,
                                xref="paper", yref="paper",
                                x=0.5, y=0.6,
                                showarrow=False,
                                font=dict(size=15, family="Palatino"),
                                align="left")
                        ]}])
        ],
        direction="down",
        x=1.0, y=1.0, 
        xanchor="right",
        yanchor="top",
        showactive=True
    )],
    sliders=sliders_w0,
    scene=dict(
        xaxis=dict(
            title="w₀", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        yaxis=dict(
            title="w₁", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        zaxis=dict(
            title="R(w₀, w₁)", 
            backgroundcolor="white", 
            gridcolor="#f0f0f0", 
            showbackground=True, 
            showline=True, 
            linecolor="black", 
            linewidth=1,
            tickfont=dict(size=10, family="Palatino"),
            title_font=dict(family="Palatino")
        ),
        bgcolor="white"
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=30, r=30, t=30, b=30),
    autosize=True,
    width=700,
    height=700
)

# Initial annotations
init_eq = title_maker(140, -8, "w0")
init_curve_label = "This <span style='color: gold; font-weight: bold;'>gold curve</span> is a function of w₁ only!"

fig.add_annotation(
    text=init_eq,
    xref="paper", yref="paper",
    x=0.02, y=0.98,
    showarrow=False,
    font=dict(size=14, family="Palatino"),
    align="left",
    bgcolor="white"
)

fig.add_annotation(
    text=init_curve_label,
    xref="paper", yref="paper",
    x=0.5, y=0.6,
    showarrow=False,
    font=dict(size=15, family="Palatino"),
    align="left",
    # bgcolor="white"
)

fig.show(scale=4)

Loading...

Click “Slider for values of $w_0$ ”. No matter where you drag that slider, the resulting gold curve is a function of $w_1$ only. Every gold curve you see when dragging the $w_0$ slider will have a minimum at some value of $w_1$ .

Then, click “Slider for values of $w_1$ ”. No matter where you drag that slider, the resulting gold curve is a function of $w_0$ only, and has some minimum value.

But there is only one combination of $w_0$ and $w_1$ where the gold curves have minimums at the exact same intersecting point. That is the combination of $w_0$ and $w_1$ that minimizes $R_\text{sq}$ , and it’s who we’re searching for.

Solving for the Optimal Parameters¶

Now, it’s time to analytically (that is, on paper) find the values of $w_0^*$ and $w_1^*$ that minimize $R_\text{sq}$ . We’ll do so by solving the following system of two equations and two unknowns:

\begin{aligned} \frac{\partial R_\text{sq}}{\partial w_0} &= -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0 \\ \frac{\partial R_\text{sq}}{\partial w_1} &= -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )=0 \end{aligned}

Here’s my plan:

In the first equation, try and isolate for $w_0$ ; this value will be called $w_0^*$ .
Plug the expression for $w_0^*$ into the second equation to solve for $w_1^*$ .

Let’s start with the first step.

-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0

Multiplying both sides by $-\frac{n}{2}$ gives us:

\sum_{i = 1}^n( \underbrace{y_i}_\text{actual}-\underbrace{w_0+w_1 x_i}_\text{predicted})=0

Before I continue, I want to highlight that this itself is an importance balance condition, much like those we discussed in Chapter 1.3. It’s saying that the sum of the errors of the optimal line’s predictions – that is, the line with intercept $w_0^*$ and slope $w_1^*$ – is 0.

Let’s continue with the first step – I’ll try and keep the commentary to a minimum. It’s important to try and replicate these steps yourself, on paper.

\begin{aligned} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n( y_i-w_0-w_1 x_i )&=0 \\ \sum_{i = 1}^n y_i - \sum_{i = 1}^n w_0 - \sum_{i = 1}^n w_1 x_i&=0 \\ \sum_{i = 1}^n y_i - nw_0 - w_1\sum_{i = 1}^n x_i&=0\\ \sum_{i = 1}^n y_i - w_1\sum_{i = 1}^n x_i&=nw_0 \\ \frac{\sum_{i = 1}^n y_i }{n}- w_1\frac{\sum_{i = 1}^n x_i}{n}&=w_0 \\ w_0^*&=\bar{y}-w_1^* \bar{x} \end{aligned}

Awesome! We’re halfway there. We have a formula for the optimal slope, $w_0^*$ , in terms of the optimal intercept, $w_1^*$ . Let’s use $w_0^* = \bar{y}-w_1^* \bar{x}$ and see where it gets us in the second equation.

\begin{aligned} -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(\underbrace{\bar{y}-w_1^* \bar{x}}_{w_0^*}+w_1^* x_i) )&=0 \\ \sum_{i = 1}^n x_i( \underbrace{y_i-\bar{y}+w_1^* \bar{x}-w_1^* x_i}_\text{distribute negation})&=0 \\ \sum_{i = 1}^n x_i \left( (y_i-\bar{y})-w_1^* ( x_i - \bar{x}) \right) &=0 \\ \underbrace{\sum_{i = 1}^n x_i (y_i-\bar{y})-w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x})}_\text{expand summation} &=0 \sum_{i = 1}^n x_i (y_i-\bar{y}) &= w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x}) \\ w_1^* &= \frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})} \end{aligned}

Rewriting and Using the Formulas¶

We’re done! We have formulas for the optimal slope and intercept. But, before we celebrate, I’m going to try and rewrite $w_1^*$ in an equivalent, more symmetrical form that is easier to interpret.

Claim:

w_1^* = \underbrace{\frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})}}_\text{formula we derived above} = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{nicer looking formula}

Proof of equivalent, nicer looking formula

To show that these two formulas are equal, I’ll start by recapping the fact that the sum of deviations from the mean is 0, in other words:

\sum_{i=1}^n (x_i - \bar{x}) = 0

This has come up in homeworks and past sections of the notes, but for completeness, here’s the proof:

\begin{aligned} &\sum_{i = 1}^n x_i - \bar{x} \\ &\sum_{i = 1}^n x_i - \sum_{i = 1}^n \bar{x} \\ &=n \bar{x} - n \bar{x} \\ &=0 \end{aligned}

Equipped with this fact, I can show that the new, more symmetric version of the formula is equal to the original one I derived.

\begin{align*} \text{new formula for } w_1^* &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\[6pt] \text{original formula for } w_1^* &= \frac{\displaystyle\sum_{i=1}^n x_i\,(y_i-\bar y)}{\displaystyle\sum_{i=1}^n x_i\,(x_i-\bar x)} \\[6pt] \text{numerator of new formula: } \quad &\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \\ &=\sum_{i=1}^n \left( x_i(y_i - \bar{y}) - \bar{x} (y_i - \bar{y}) \right) \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) - \sum_{i=1}^n \bar{x}(y_i - \bar{y}) \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) - \bar{x} \underbrace{\sum_{i=1}^n (y_i - \bar{y})}_{=0} \\ &=\sum_{i=1}^n x_i(y_i - \bar{y}) \\ &=\text{numerator of original formula!} \\ \\ \\ \text{denominator of new formula: } \quad &\sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x}) \\ &= \sum_{i=1}^n x_i(x_i - \bar{x}) - \underbrace{\bar{x} \sum_{i=1^n} (x_i - \bar{x})}_\text{same logic as in the denominator case} \\ &= \sum_{i = 1}^n x_i (x_i - \bar{x}) \\ &= \text{denominator of original formula!} \end{align*}

We skipped some steps in the denominator case, since many of them are the same as in the numerator case. Nevertheless, since we’ve shown that the numerators of both formulas are the same and the denominators of both formulas are the same, well, both formulas are the same!

This is not the only other equivalent formula for the slope; for instance, $w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})y_i}{\sum_{i=1}^n (x_i - \bar{x})^2}$ too, and you can verify this using the same logic as in the proof above.

To summarize, the parameters that minimize mean squared error for the simple linear regression model, $h(x_i) = w_0 + w_1 x_i$ , are:

\boxed{w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}}

This is an important result, and you should remember it. There are a lot of symbols above, but just note that given a dataset $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ , you could apply the formulas above by hand to find the optimal parameters yourself.

What does this line look like on the commute times data?

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('data/commute-times.csv')

# Compute means
x = df['departure_hour'].values
y = df['minutes'].values
x_bar = np.mean(x)
y_bar = np.mean(y)

# Compute slope (w1) and intercept (w0) using the closed-form solution
w1 = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar) ** 2)
w0 = y_bar - w1 * x_bar

# Prepare regression line points
x_line = np.array([x.min(), x.max()])
y_line = w0 + w1 * x_line

# Create scatter plot
fig = px.scatter(
    df,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(df)) * 50,
    size_max=8
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)

# Add regression line in orange
fig.add_traces(go.Scatter(
    x=x_line,
    y=y_line,
    mode='lines',
    line=dict(color='orange', width=3),
    name='Regression Line'
))

fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    showlegend=False
)
fig.show(renderer='png', scale=3)

The line above goes by many names:

The simple linear regression line that minimizes mean squared error.
The simple linear regression line (if said without context).
The regression line.
The least squares regression line (because it has the least mean squared error).
The line of best fit.

Whatever you’d like to call it, now that we’ve found our optimal parameters, we can use them to make predictions.

h(x_i) = w_0^* + w_1^* x_i

On the dataset of commute times:

# Assume x is an array with departure hours and y is an array with commute times.
w1_star = np.sum((x - np.mean(x)) * (y - np.mean(y))) / np.sum((x - np.mean(x)) ** 2)
w0_star = np.mean(y) - w1_star * np.mean(x)

w0_star, w1_star

(142.4482415877287, -8.186941724265552)

So, our specific fit, or trained hypothesis function is:

\begin{align*} \text{predicted commute time}_i &= h(\text{departure hour}_i) \\ &= 142.45 - 8.19 \cdot \text{departure hour}_i \end{align*}

This trained hypothesis function is not saying that leaving later causes you to have shorter commutes. Rather, that’s just the best linear pattern it observed in the data for the purposes of minimizing mean squared error. In reality, there are other factors that affect commute times, and we haven’t performed a thorough-enough analysis to say anything about the causal relationship between departure time and commute time.

To predict how long it might take to get to school tomorrow, plug in the time you’d like to leave for $\text{departure hour}_i$ and out will come your prediction. The slope, -8.19, is in units $\frac{\text{units of } y}{\text{units of } x} = \frac{\text{minutes}}{\text{hour}}$ , and is telling us that for every hour later you leave, your predicted commute time decreases by 8.19 minutes.

In Python, I can define a predict function as follows:

def predict(x_new):
    return w0_star + w1_star * x_new

# Predicted commute time if I leave at 8:30AM.
predict(8.5)

72.8592369314715

Regression Line Passes Through the Mean¶

There’s an important property that the regression line satisfies: for any dataset, the line that minimizes mean squared error passes through the point $(\text{mean of } x, \text{mean of } y)$ .

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('data/commute-times.csv')

# Compute means
x = df['departure_hour'].values
y = df['minutes'].values
x_bar = np.mean(x)
y_bar = np.mean(y)

# Compute slope (w1) and intercept (w0) using the closed-form solution
w1 = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar) ** 2)
w0 = y_bar - w1 * x_bar

# Prepare regression line points
x_line = np.array([x.min(), x.max()])
y_line = w0 + w1 * x_line

# Create scatter plot
fig = px.scatter(
    df,
    x='departure_hour',
    y='minutes',
    size=np.ones(len(df)) * 50,
    size_max=8
)
fig.update_traces(marker_color="#3D81F6", marker_line_width=0)

# Add regression line in orange
fig.add_traces(go.Scatter(
    x=x_line,
    y=y_line,
    mode='lines',
    line=dict(color='orange', width=3),
    name='Regression Line'
))

# Add gold point at (x_bar, y_bar) and label it
fig.add_trace(go.Scatter(
    x=[x_bar],
    y=[y_bar],
    mode='markers+text',
    marker=dict(color='orange', size=14, line=dict(width=1, color='black')),
    text=[r'<b>(x̄, ȳ)</b>'],
    textfont=dict(color='orange', size=18),
    textposition='top right',
    name='Mean Point',
    showlegend=False
))

fig.update_xaxes(
    title='Home Departure Time (AM)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    title='Commute Time (Minutes)',
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    width=700,
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    showlegend=False
)
fig.show(renderer='png', scale=3)

predict(np.mean(x))

73.18461538461538

# Same!
np.mean(y)

73.18461538461538

Our commute times regression line passes through the point $(\bar{x}, \bar{y})$ , even if that was not necessarily one of the original points in the dataset.

Intuitively, this says that for an average input, the line that minimizes mean squared error will always predict an average output.

Why is this fact true? See if you can reason about it yourself, then check the solution once you’ve attempted it.

Proof that the regression line passes through

(\bar{x}, \bar{y})

Our regression line is:

h(x_i) = w_0^* + w_1^* x_i

We know that the optimal intercept of the regression line is:

w_0^* = \bar{y} - w_1^* \bar{x}

Substituting this in yields:

h(x_i) = \underbrace{\bar{y} - w_1^* \bar{x}}_{w_0^*} + w_1^* x_i

If we plug in $x_i = \bar{x}$ , we have:

h(\bar{x}) = \bar{y} \: \underbrace{- w_1^* \bar{x} + w_1^* \bar{x}}_\text{cancels out} = \bar{y}

My interpretation of this is that the intercept is chosen to “vertically adjust” the line so that it passes through $(\bar{x}, \bar{y})$ .

The Modeling Recipe¶

To conclude, let’s run through the three-step modeling recipe.

1. Choose a model.

h(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We chose squared loss:

L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss to find optimal parameters.

For the simple linear regression model, empirical risk is:

R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

We showed that:

w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

While the process of minimizing $R_\text{sq}$ was much, much more complex than in the case of our single parameter model, the conceptual backing of the process was still this three-step recipe, and hopefully now you see its value.

Correlation¶

Sometimes, we’re not necessarily interested in making predictions, but instead want to be descriptive about patterns that exist in data.

In a scatter plot of two variables, if there is any pattern, we say the variables are associated. If the pattern resembles a straight line, we say the variables are correlated, i.e. linearly associated. We can measure how much a scatter plot resembles a straight line using the correlation coefficient.

Interpreting the Correlation Coefficient¶

There are actually many different correlation coefficients; this is the most common one, and it’s sometimes called the Pearson’s correlation coefficient, after the British statistician Karl Pearson.

No matter the values of $x_1, x_2, \ldots x_n$ and $y_1, y_2, \ldots y_n$ , the value of $r$ is bounded between -1 and 1. The closer $|r|$ is to 1, the stronger the linear association. The sign of $r$ tells us the direction of the trend – upwards (positive) or downwards (negative). $r$ is a unitless quantity – it’s not measured in hours, or dollars, or minutes, or anything else that depends on the units of $x$ and $y$ .

import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots

np.random.seed(42)

# 1. Roughly circular cloud, little to no correlation
theta = np.linspace(0, 2 * np.pi, 50)
r = 3 + np.random.normal(0, 0.5, size=theta.shape)
x1 = r * np.cos(theta) + np.random.normal(0, 0.3, size=theta.shape)
y1 = r * np.sin(theta) + np.random.normal(0, 0.3, size=theta.shape)
r1 = np.corrcoef(x1, y1)[0, 1]

# 2. Tight positive linear
x2 = np.linspace(0, 10, 50)
y2 = -1.2 * x2 + 2 + np.random.normal(0, 0.5, size=x2.shape)
r2 = np.corrcoef(x2, y2)[0, 1]

# 3. V shape (two strong linear trends)
x3_left = np.linspace(0, 5, 25)
y3_left = -1.2 * x3_left + 10 + np.random.normal(0, 0.3, size=x3_left.shape)
x3_right = np.linspace(5, 10, 25)
y3_right = 1.2 * x3_right - 2 + np.random.normal(0, 0.3, size=x3_right.shape)
x3 = np.concatenate([x3_left, x3_right])
y3 = np.concatenate([y3_left, y3_right])
r3 = np.corrcoef(x3, y3)[0, 1]

# 4. Much looser positive linear (increase noise)
x4 = np.linspace(0, 10, 50)
y4 = 1.2 * x4 + 2 + np.random.normal(0, 5.0, size=x4.shape)  # Increased noise from 2.0 to 5.0
r4 = np.corrcoef(x4, y4)[0, 1]

# Use LaTeX for subplot titles and make them much bigger
subplot_titles = [
    r"r = {:.3f}".format(r1),
    r"r = {:.3f}".format(r2),
    r"r = {:.3f}".format(r3),
    r"r = {:.3f}".format(r4)
]

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=subplot_titles,
    vertical_spacing=0.1
)

fig.add_trace(
    go.Scatter(x=x1, y=y1, mode='markers', marker=dict(color='#3d81f6', opacity=0.7), showlegend=False),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=x2, y=y2, mode='markers', marker=dict(color='#3d81f6', opacity=0.7), showlegend=False),
    row=1, col=2
)
fig.add_trace(
    go.Scatter(x=x3_left, y=y3_left, mode='markers', marker=dict(color='#3d81f6', opacity=0.7), showlegend=False),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=x3_right, y=y3_right, mode='markers', marker=dict(color='#3d81f6', opacity=0.7), showlegend=False),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=x4, y=y4, mode='markers', marker=dict(color='#3d81f6', opacity=0.7), showlegend=False),
    row=2, col=2
)

for i in range(1, 3):
    for j in range(1, 3):
        fig.update_xaxes(
            title_text="x",
            row=i, 
            col=j,
            gridcolor='#f0f0f0',
            showline=True,
            linecolor="black",
            linewidth=1,
        )
        fig.update_yaxes(
            title_text="y", 
            row=i,
            col=j,
            gridcolor='#f0f0f0',
            showline=True,
            linecolor="black",
            linewidth=1,
        )

fig.update_layout(
    height=700, 
    width=700,
    showlegend=False,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    # Make subplot titles much bigger and use LaTeX
    annotations=[
        dict(
            text=title,
            x=0.225 if idx == 0 else 0.775 if idx == 1 else 0.225 if idx == 2 else 0.775,
            y=0.98 if idx < 2 else 0.4,
            xref="paper",
            yref="paper",
            showarrow=False,
            font=dict(size=28, family="Palatino Linotype, Palatino, serif"),  # Increased from 28 to 36
            align="center"
        )
        for idx, title in enumerate(subplot_titles)
    ]
)
fig.show(renderer="png", scale=3)

The plots above give us some examples of what the correlation coefficient can look like in practice.

Top left ( $r = 0.046$ ): There’s some loose circle-like pattern, but it mostly looks like a random cloud of points. $|r|$ is close to 0, but just happens to be positive.
Top right ( $r = -0.993$ ): The points are very tightly clustered around a line with a negative slope, so $r$ is close to -1.
Bottom left ( $r = -0.031$ ): While the points are certainly associated, they are not linearly associated, so the value of $r$ is close to 0. (The shape looks more like a V or parabola than a straight line.)
Bottom right ( $r = 0.607$ ): The points are loosely clustered and follow a roughly linear pattern trending upwards. $r$ is positive, but not particularly large.

The correlation coefficient has some useful properties to be aware of. For one, it’s symmetric: $r(x, y) = r(y, x)$ . If you swap the $x_i$ ’s and $y_i$ ’s in its formula, you’ll see the result is the same.

r = \frac{1}{n} \sum_{i = 1}^n \left( \frac{x_i - \bar{x}}{\sigma_x} \right) \left( \frac{y_i - \bar{y}}{\sigma_y} \right)

One way to think of $r$ is that it’s the mean of the product of $x$ and $y$ , once both variables have been standardized. To standardize a collection of numbers $x_1, x_2, \ldots x_n$ , you first find the mean $\bar{x}$ and standard deviation $\sigma_x$ of the collection. Then, for each $x_i$ , you compute:

z_i = \frac{x_i - \bar{x}}{\sigma_x}

This tells you how many standard deviations away from the mean each $x_i$ is. For example, if $z_i = -1.5$ , that means $x_i$ is 1.5 standard deviations below the mean of $x$ . The value of $x_i$ once it’s standardized is sometimes called its z-score; you may have heard of $z$ -scores in the context of curved exam scores.

With this in mind, I’ll again state that $r$ is the mean of the product of $x$ and $y$ , once both variables have been standardized:

r = {\color{orange} \frac{1}{n} \sum_{i = 1}^n} \underbrace{\left( {\color{#3d81f6} \frac{x_i - \bar{x}}{\sigma_x}} \right)}_{x_i\text{'s $z$-score}} {\color{#d81b60} \times} \underbrace{\left( {\color{#3d81f6} \frac{y_i - \bar{y}}{\sigma_y}} \right)}_{y_i\text{'s $z$-score}}

This interpretation of $r$ makes it a bit easier to see why $r$ measures the strength of linear association – because up until now, it must seem like a formula I pulled out of thin air.

If there’s positive linear association, then $x_i$ and $y_i$ will usually either both be above their averages, or both be below their averages, meaning that $x_i - \bar{x}$ and $y_i - \bar{y}$ will usually have the same sign. If we multiply two numbers with the same sign – either both positive or both negative – then the product will be positive.

import numpy as np
import plotly.graph_objects as go

np.random.seed(42)
n = 60
x = np.random.normal(10, 2, n)
# Create y with r ≈ 0.75 to x, y mean ≈ -6
r = 0.75
y = r * (x - np.mean(x)) / np.std(x) * 3 + np.random.normal(0, np.sqrt(1 - r**2) * 3, n) - 6

x_mean = np.mean(x)
y_mean = np.mean(y)
corr = np.corrcoef(x, y)[0, 1]

x_min, x_max = min(x)-2, max(x)+2
y_min, y_max = min(y)-2, max(y)+2

# Compute (x - x̄), (y - ȳ), and their product for each point
x_centered = x - x_mean
y_centered = y - y_mean
product = x_centered * y_centered

# Prepare custom hover text
hover_text = [
    f"x = {xi:.2f}<br>y = {yi:.2f}<br>x - ȳ = {xc:.2f}<br>y - ȳ = {yc:.2f}<br>product = {p:.2f}"
    .replace("x - ȳ", "x - \\bar{{x}}").replace("y - ȳ", "y - \\bar{{y}}")
    for xi, yi, xc, yc, p in zip(x, y, x_centered, y_centered, product)
]

fig = go.Figure()

# Shade quadrants: top right and bottom left green, others red
# Top right (x > x_mean, y > y_mean): light orange
fig.add_shape(
    type="rect",
    x0=x_mean, x1=x_max,
    y0=y_mean, y1=y_max,
    fillcolor="rgba(255, 183, 77, 0.25)",  # light orange
    line=dict(width=0),
    layer="below"
)
# Bottom left (x < x_mean, y < y_mean): light orange
fig.add_shape(
    type="rect",
    x0=x_min, x1=x_mean,
    y0=y_min, y1=y_mean,
    fillcolor="rgba(255, 183, 77, 0.25)",  # light orange
    line=dict(width=0),
    layer="below"
)
# Top left (x < x_mean, y > y_mean): light #004d40
fig.add_shape(
    type="rect",
    x0=x_min, x1=x_mean,
    y0=y_mean, y1=y_max,
    fillcolor="rgba(0, 77, 64, 0.15)",  # light #004d40
    line=dict(width=0),
    layer="below"
)
# Bottom right (x > x_mean, y < y_mean): light #004d40
fig.add_shape(
    type="rect",
    x0=x_mean, x1=x_max,
    y0=y_min, y1=y_mean,
    fillcolor="rgba(0, 77, 64, 0.15)",  # light #004d40
    line=dict(width=0),
    layer="below"
)

fig.add_trace(go.Scatter(
    x=x, y=y,
    mode='markers',
    marker=dict(size=10, color="#3D81F6", line=dict(width=0), opacity=0.7),
    showlegend=False,
    hovertemplate=(
        "x = %{x:.2f}<br>"
        "y = %{y:.2f}<br>"
        "x - &#772;x = %{customdata[0]:.2f}<br>"
        "y - &#772;y = %{customdata[1]:.2f}<br>"
        "product = %{customdata[2]:.2f}<extra></extra>"
    ),
    customdata=np.stack([x_centered, y_centered, product], axis=-1)
))

# Add thick black vertical line at x = mean(x)
fig.add_shape(
    type="line",
    x0=x_mean, x1=x_mean,
    y0=y_min, y1=y_max,
    line=dict(color="black", width=4)
)
# Add thick black horizontal line at y = mean(y)
fig.add_shape(
    type="line",
    x0=x_min, x1=x_max,
    y0=y_mean, y1=y_mean,
    line=dict(color="black", width=4)
)

# Annotate mean of x along the y-axis, vertical and rotated, using \bar{x}
fig.add_annotation(
    x=x_mean+0.1,
    y=-14,
    text=f"$\\bar{{x}} = {x_mean:.2f}$",
    showarrow=False,
    font=dict(size=20, color="black"),
    textangle=-90,
    xanchor="left",
    yanchor="middle"
)

# Annotate mean of y along the x-axis, horizontal, using \bar{y}
fig.add_annotation(
    x=6,
    y=y_mean+0.5,
    text=f"$\\bar{{y}} = {y_mean:.2f}$",
    showarrow=False,
    font=dict(size=20, color="black"),
    xanchor="center",
    yanchor="bottom"
)

fig.update_layout(
    title={
        'text': f"r = {corr:.2f}<br>most values are in bottom-left or top-right",
        'x': 0.5,
        'xanchor': 'center'
    },
    xaxis_title="x",
    yaxis_title="y",
    width=600,
    height=450,
    plot_bgcolor='white',
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    margin=dict(l=60, r=60, t=60, b=60)
)
fig.update_xaxes(
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)
fig.update_yaxes(
    gridcolor='#f0f0f0',
    showline=True,
    linecolor="black",
    linewidth=1,
)

# Add annotation with arrow to (6, -10)
fig.add_annotation(
    x=6.1,
    y=-10,
    text="$(x_i - \\bar{x})(y_i - \\bar{y}) > 0$",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=1,
    arrowcolor="black",
    ax=5,  # horizontal offset in pixels
    ay=30, # vertical offset in pixels
    font=dict(size=12, color="black"),
    xanchor="center",
    yanchor="top"
)

# Add annotation with arrow to (6, -10)
fig.add_annotation(
    x=11.5,
    y=-7,
    text="$(x_i - \\bar{x})(y_i - \\bar{y}) < 0$",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=1,
    arrowcolor="black",
    ax=5,  # horizontal offset in pixels
    ay=30, # vertical offset in pixels
    font=dict(size=12, color="black"),
    xanchor="center",
    yanchor="top"
)

fig.show(renderer="png", scale=3)

Since most points are in the bottom-left and top-right quadrants, most of the products $(x_i - \bar{x})(y_i - \bar{y})$ are positive. This means that $r$ , which is the average of these products divided by the standard deviations of $x$ and $y$ , will be positive too. (We divide by the standard deviations to ensure that $-1 \leq r \leq 1$ .)

Above, $r$ is positive but not exactly 1, since there are several points in the bottom-right and top-left quadrants, who would have a negative product $(x_i - \bar{x})(y_i - \bar{y})$ and bring down the average product.

If there’s negative linear association, then typically it’ll be the case that $x_i$ is above average while $y_i$ is below average, or vice versa. This means that $x_i - \bar{x}$ and $y_i - \bar{y}$ will usually have opposite signs, and when they have opposite signs, their product will be negative. If most points have a negative product, then $r$ will be negative too.

Preserving Correlation¶

Since $r$ measures how closely points cluster around a line, it is invariant to units of measurement, or linear transformations of the variables independently.

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.random.seed(42)
n = 60
x = np.random.normal(10, 2, n)
r = 0.75
y = r * (x - np.mean(x)) / np.std(x) * 3 + np.random.normal(0, np.sqrt(1 - r**2) * 3, n) - 6

# Compute axis limits from original data (to keep axes fixed)
x_min, x_max = min(x)-7, max(x)+7
y_min, y_max = min(y)-7, max(y)+7

# Prepare transformed datasets and their correlations
transforms = [
    (x, y,         r"x(x,y)",      np.corrcoef(x, y)[0, 1], "r(x,y)"),
    (x / 2, y,         r"x(x/2, y)",      np.corrcoef(x / 2, y)[0, 1], "r(x/2, y)"),
    (x + 3, 2 * y,       r"x(x+3, 2y)",      np.corrcoef(x + 3, 2 * y)[0, 1], "r(x+3, 2y)"),
    (x + 1, -y/2 - 10,    r"x(x+1, -y/2 - 10)",  np.corrcoef(x + 1, -y/2 - 10)[0, 1], "r(x+1, -y/2 - 1)"),
]

# For the last plot, show -r in the title (negate the correlation)
transforms[-1] = (transforms[-1][0], transforms[-1][1], transforms[-1][2], -np.corrcoef(x, y)[0, 1], "r(-0.5x, y+1)")

titles = [
    f"r(x, y) = {transforms[0][3]:.2f}",
    f"r(x/2, y) = {transforms[1][3]:.2f}",
    f"r(x+3, 2y) = {transforms[2][3]:.2f}",
    f"r(x+1, -y/2 -10) = {transforms[3][3]:.2f}",
]

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=titles,
    horizontal_spacing=0.12,
    vertical_spacing=0.12
)

for i, (x_t, y_t, _, _, _) in enumerate(transforms):
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(
        go.Scatter(
            x=x_t, y=y_t,
            mode='markers',
            marker=dict(size=10, color="#3D81F6", line=dict(width=0), opacity=0.7),
            showlegend=False,
            hovertemplate="x = %{x:.2f}<br>y = %{y:.2f}<extra></extra>"
        ),
        row=row, col=col
    )
    fig.update_xaxes(range=[x_min, x_max], row=row, col=col)
    fig.update_yaxes(range=[y_min, y_max], row=row, col=col)

fig.update_layout(
    height=900,
    width=1100,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black",
    ),
    xaxis_title="x",
    yaxis_title="y",
    annotations=[
        dict(font=dict(size=28, family="Palatino Linotype, Palatino, serif"))
    ]
)

fig.show(renderer="png", scale=3)

The top left scatter plot is the same as in the previous example, where we reasoned about why $r$ is positive. The other three plots result from applying linear transformations to the $x$ and/or $y$ variables independently. A linear transformation of $x$ is any function of the form $ax + b$ , and a linear transformation of $y$ is any function of the form $cy + d$ . (This is an idea we’ll revisit more in Chapter 2.)

Notice that three of the four plots have the same $r$ of approximately 0.79. The bottom right plot has an $r$ of approximately -0.79, because the $y$ coordinates were multiplied by a negative constant. What we’re seeing is that the correlation coefficient is invariant to linear transformations of the two variables independently.

Put in real-world terms: it doesn’t matter if you measure commute times in hours, minutes, or seconds, the correlation between departure time and commute time will be the same in all three cases.

Correlation and the Regression Line¶

Since $r$ measures how closely points cluster around a line, it shouldn’t be all that surprising that $r$ has something to do with $w_1^*$ , the slope of the regression line.

It turns out that:

w_1^* = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{from earlier} = \boxed{r \frac{\sigma_y}{\sigma_x}}

This is my preferred version of the formula for the optimal slope – it’s easy to use and interpret. I’ve hidden the proof behind a dropdown menu below, but you really should attempt it on your own (and then read it), since it helps build familiarity with how the various components of the formula for $r$ and $w_1^*$ are related.

Proof that

w_1^* = r\frac{\sigma_y}{\sigma_x}

First, let’s show that we can express $\sum_{i=1}^n (x_i-\bar x)^2$ in terms of $\sigma_x$ :

\begin{aligned} &\sigma_x=\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2} \\ &\sigma_x^2=\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 \\ &n\sigma_x^2=\sum_{i=1}^n (x_i - \bar{x})^2 \\ \end{aligned}

Now, let’s prove that $w_1^* = r\frac{\sigma_y}{\sigma_x}$ :

\begin{aligned} &r\frac{\sigma_y}{\sigma_x}=\frac{1}{n}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{\sigma_x} \right) \left(\frac{y_i - \bar{y}}{\sigma_y} \right)\cdot \frac{\sigma_y}{\sigma_x} \\ &=\frac{1}{n}\sum_{i=1}^n \frac{(x_i - \bar{x})(y_i - \bar{y})}{\sigma_x^2} \\ &= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{n\sigma_x^2} \\ &=\frac{\displaystyle\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2} \\ &=w_1^* \end{aligned}

The simpler formula above implies that the sign of the slope is the same as the sign of $r$ , which seems reasonable: if the direction of the linear association is negative, the best-fitting slope should be, too.

So, all in one place:

w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r \frac{\sigma_y}{\sigma_x}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

This new formula for the slope also gives us insight into how the spread of $x$ ( $\sigma_x$ ) and $y$ ( $\sigma_y$ ) affects the slope. If $y$ is more spread out than $x$ , the points on the scatter plot will be stretched out vertically, which will make the best-fitting slope steeper.

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create base data: negative linear association, shallow slope
np.random.seed(0)
x_base = np.linspace(-5, 30, 20)
y_base = -0.5 * x_base + 5 + np.random.normal(0, 0.5, size=x_base.shape)

x_range = [-1, 30]
y_range = [-10, 10]

# Create subplots: 1 row, 3 columns
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=('', '', '')  # We'll update these with custom annotations
)

# Base plot
fig.add_trace(
    go.Scatter(x=x_base, y=y_base, mode='markers', marker=dict(color='#3d81f6', opacity=0.7)),
    row=1, col=1
)
# Regression line for base
m1, b1 = np.polyfit(x_base, y_base, 1)
fig.add_trace(
    go.Scatter(x=x_base, y=m1 * x_base + b1, mode='lines', line=dict(color='orange', width=3), showlegend=False),
    row=1, col=1
)

# X stretched plot (y values stretched by factor of 2)
x_new, y_new = x_base, y_base * 2
fig.add_trace(
    go.Scatter(x=x_new, y=y_new, mode='markers', marker=dict(color='#3d81f6', opacity=0.7)),
    row=1, col=2
)
m2, b2 = np.polyfit(x_new, y_new, 1)
fig.add_trace(
    go.Scatter(x=x_new, y=m2 * x_new + b2, mode='lines', line=dict(color='orange', width=3), showlegend=False),
    row=1, col=2
)

# Y stretched plot (x values stretched by factor of 3)
x_new, y_new = x_base * 3, y_base
fig.add_trace(
    go.Scatter(x=x_new, y=y_new, mode='markers', marker=dict(color='#3d81f6', opacity=0.7)),
    row=1, col=3
)
m3, b3 = np.polyfit(x_new, y_new, 1)
fig.add_trace(
    go.Scatter(x=x_new, y=m3 * x_new + b3, mode='lines', line=dict(color='orange', width=3), showlegend=False),
    row=1, col=3
)

# Set axes ranges and labels for all subplots
for i in range(1, 4):
    fig.update_xaxes(range=x_range, title_text="x", row=1, col=i, showline=True, linecolor="black", gridcolor="#f0f0f0")
    fig.update_yaxes(range=y_range, title_text="y", row=1, col=i, showline=True, linecolor="black", gridcolor="#f0f0f0")

# Add custom annotations for titles with slope and intercept information
fig.add_annotation(
    text=r"$\text{Original Data} \\ \text{slope: } w_1^* \\ \text{intercept: } w_0^*$",
    x=0.167, y=1.1, xref="paper", yref="paper",
    showarrow=False, font=dict(size=12, color="black"),
    xanchor="center"
)

fig.add_annotation(
    text=r"$y_i \rightarrow 2y_i \\ \text{slope: } 2w_1^* \\ \text{intercept: } 2w_0^*$",
    x=0.54, y=1.1, xref="paper", yref="paper",
    showarrow=False, font=dict(size=12, color="black"),
    xanchor="center"
)

fig.add_annotation(
    text=r"$x_i \rightarrow 3x_i \\ \text{slope: } w_1^* / 3 \\ \text{intercept: } w_0^*$",
    x=1.0, y=1.1, xref="paper", yref="paper",
    showarrow=False, font=dict(size=12, color="black"),
    # xanchor="left"
)

fig.update_layout(
    height=350, width=600,
    showlegend=False,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=40, r=40, t=80, b=40),  # Increased top margin for titles
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    )
)

fig.show(renderer='png', scale=3)

In the middle example above, $y_i \rightarrow 2y_i$ means that we replaced each $y_i$ in the dataset with $2y_i$ . In that example, the slope and intercept of the regression line both doubled. In the third example, where we replaced each $x_i$ with $3x_i$ , the slope was divided by 3, while the intercept remained. One of the problems in Homework 2 has you prove these sorts of results, and you can do so by relying on the formula for $w_1^*$ that involves $r$ ; note that all three datasets above have the same $r$ .

Activity 4

This activity is an old exam question, taken from an exam that used to allow calculators.

First, suppose we minimize mean squared error to fit a simple linear regression line that uses the square footage of a house to predict its price. The resulting line has an intercept of $w_0^*$ and a slope of $w_1^*$ . In other words:
$\text{predicted price}_i = w_0^* + w_1^* \text{square footage}_i$
We’re now interested in minimizing mean squared error to find a simple linear regression line that uses price to predict square footage. Suppose this new regression line has an intercept of $\beta_0^*$ and a slope of $\beta_1^*$ .
What is $\beta_1^*$ ? Give your answer as an expression in terms of $n$ , $r$ , $w_0^*$ , and/or $w_1^*$ .
Given that:

$n = 100$
$r = 0.6$
$w_0^* = 1000$
$w_1^* = 250$ |
The average square footage of houses in the dataset is 2000

What is $\beta_0^*$ ? Your answer should be a constant with no variables. (Once you’re able to express $\beta_0^*$ in terms of constants only, you can stop simplifying your answer.)

Example: Anscombe’s Quartet¶

The correlation coefficient is just one number that describes the linear association between two variables; it doesn’t tell us everything.

Consider the famous example of Anscombe’s quartet, which consists of four datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different.

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

anscombe = pd.read_csv('data/anscombe.csv')

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f"Dataset {n}" for n in ['I', 'II', 'III', 'IV']],
    horizontal_spacing=0.12,
    vertical_spacing=0.12
)

for i, n in enumerate(['I', 'II', 'III', 'IV']):
    rows = anscombe[anscombe['dataset'] == n]
    x = rows['x']
    y = rows['y']
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(
        go.Scatter(
            x=x, y=y,
            mode='markers',
            marker=dict(size=14, color='#3d81f6', opacity=0.7),
            showlegend=False
        ),
        row=row, col=col
    )

fig.update_layout(
    height=1000,
    width=1200,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    annotations=[
        dict(font=dict(size=28, family="Palatino Linotype, Palatino, serif"))
    ]
)

for i in range(1, 5):
    fig.update_xaxes(
        title_text="x",
        row=(i-1)//2+1, 
        col=(i-1)%2+1,
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )
    fig.update_yaxes(
        title_text="y", 
        row=(i-1)//2+1, 
        col=(i-1)%2+1,
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )

fig.show(renderer="png", scale=3)

In all four datasets:

\bar{x} = 9, \bar{y} = 7.5, \sigma_x = 3.16, \sigma_y = 1.94, r = 0.82

Because they all share the same five values of these key quantities, they also share the same regression lines, since the optimal slope and intercept are determined just using those five quantities.

w_1^* = r \frac{\sigma_y}{\sigma_x} = 0.82 \frac{1.94}{3.16} = 0.52 \qquad w_0^* = \bar{y} - w_1^* \bar{x} = 7.5 - 0.52 \cdot 9 = 2.82

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

anscombe = pd.read_csv('data/anscombe.csv')

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f"Dataset {n}" for n in ['I', 'II', 'III', 'IV']],
    horizontal_spacing=0.12,
    vertical_spacing=0.12
)

for i, n in enumerate(['I', 'II', 'III', 'IV']):
    rows = anscombe[anscombe['dataset'] == n]
    x = rows['x']
    y = rows['y']
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(
        go.Scatter(
            x=x, y=y,
            mode='markers',
            marker=dict(size=14, color='#3d81f6', opacity=0.7),
            showlegend=False
        ),
        row=row, col=col
    )

fig.update_layout(
    height=1000,
    width=1200,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(l=60, r=60, t=60, b=60),
    font=dict(
        family="Palatino Linotype, Palatino, serif",
        color="black"
    ),
    annotations=[
        dict(font=dict(size=28, family="Palatino Linotype, Palatino, serif"))
    ]
)

for i in range(1, 5):
    fig.update_xaxes(
        title_text="x",
        row=(i-1)//2+1, 
        col=(i-1)%2+1,
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )
    fig.update_yaxes(
        title_text="y", 
        row=(i-1)//2+1, 
        col=(i-1)%2+1,
        gridcolor='#f0f0f0',
        showline=True,
        linecolor="black",
        linewidth=1,
    )
for i, n in enumerate(['I', 'II', 'III', 'IV']):
    rows = anscombe[anscombe['dataset'] == n]
    x = rows['x']
    y = rows['y']
    w1_ans = ((x - x.mean()) * (y - y.mean())).sum() / ((x - x.mean()) ** 2).sum()
    w0_ans = y.mean() - w1_ans * x.mean()
    # Sort x for a proper regression line
    x_sorted = np.sort(x)
    y_pred = w0_ans + w1_ans * x_sorted
    row = i // 2 + 1
    col = i % 2 + 1
    # Add regression line to the corresponding subplot
    fig.add_trace(
        go.Scatter(
            x=x_sorted,
            y=y_pred,
            mode='lines',
            line=dict(color='orange', width=4),
            showlegend=False
        ),
        row=row, col=col
    )


fig.show(renderer="png", scale=4)

The regression line clearly looks better for some datasets than others, with Dataset IV looking particularly off. A high $|r|$ may be evidence of a strong linear association, but it cannot guarantee that a linear model is suitable for a dataset. Moral of the story - visualize your data before trying to fit a model! Don’t just trust the numbers.

You might like the Datasaurus Dozen, another similar collection of 13 datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different. (One looks like a dinosaur!)

Introduction¶

Functions of Multiple Variables¶

Partial Derivatives¶

Optimization¶

Minimizing Mean Squared Error¶

Finding the Partial Derivatives¶

Solving for the Optimal Parameters¶

Rewriting and Using the Formulas¶

Regression Line Passes Through the Mean¶

The Modeling Recipe¶

Correlation¶

Interpreting the Correlation Coefficient¶

Preserving Correlation¶

Correlation and the Regression Line¶

Example: Anscombe’s Quartet¶