Skip to article frontmatterSkip to article content

1.4. Simple Linear Regression

Introduction

The time has finally come: let’s apply what we’ve learned about loss functions and the modeling recipe to “upgrade” from the constant model to the simple linear regression model.

To recap, our goal is to find a hypothesis function hh such that:

predicted commute timei=h(departure houri)\text{predicted commute time}_i = h(\text{departure hour}_i)
Image produced in Jupyter

So far, we’ve studied the constant model, where the hypothesis function is a horizontal line:

h(xi)=wh(x_i) = w

The sole parameter, ww, controlled the height of the line. Up until now, “parameter” and “prediction” were interchangeable terms, because our sole parameter ww controlled what our constant prediction was.

Now, the simple linear regression model has two parameters:

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

w0w_0 controls the intercept of the line, and w1w_1 controls its slope. No longer is it the case that “parameter” and “prediction” are interchangeable terms, because w0w_0 and w1w_1 control different aspects of the prediction-making process.

How do we find the optimal parameters, w0w_0^* and w1w_1^*? Different values of w0w_0 and w1w_1 give us different lines, each of which fit the data with varying degrees of accuracy.

Image produced in Jupyter

To make things precise, let’s turn to the three-step modeling recipe from Chapter 1.3.

1. Choose a model.

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We’ll stick with squared loss:

Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss to find optimal parameters.

Mean squared error, for any hypothesis function hh, takes the form:

1ni=1n(yih(xi))2\frac{1}{n} \sum_{i=1}^n (y_i - h(x_i))^2

For the simple linear regression model, this becomes:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Now, we need to find the values of w0w_0 and w1w_1 that together minimize Rsq(w0,w1)R_\text{sq}(w_0, w_1). But what does that even mean?

In the case of the context model and squared loss, where we had to minimize Rsq(w)=1ni=1n(yiw)2 R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2, we did so by taking the derivative with respect to ww and setting it to 0.

Image produced in Jupyter

Rsq(w)R_\text{sq}(w) was a function with just a single input variable (ww), so the problem of minimizing Rsq(w)R_\text{sq}(w) was straightforward, and resembled problems we solved in Calculus 1.

The function Rsq(w0,w1)R_\text{sq}(w_0, w_1) we’re minimizing now has two input variables, w0w_0 and w1w_1. In mathematics, sometimes we’ll write Rsq:R2RR_\text{sq}: \mathbb{R}^2 \to \mathbb{R} to say that RsqR_\text{sq} is a function that takes in two real numbers and returns a single real number.

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Remember, we should treat the xix_i’s and yiy_i’s as constants, as these are known quantities once we’re given a dataset.

What does Rsq(w0,w1)R_\text{sq}(w_0, w_1) even look like? We need three dimensions to visualize it – one axis for w0w_0, one for w1w_1, and one for the output, Rsq(w0,w1)R_\text{sq}(w_0, w_1).

Loading...

The graph above is called a loss surface, even though it’s a graph of empirical risk, i.e. average loss, not the loss for a single data point. The plot is interactive, so you should drag it around to get a sense of what it looks like. It looks like a parabola with added depth, similar to how cubes look like squares with added depth. Lighter regions above correspond to low mean squared error, and darker regions correspond to high mean squared error.

Think of the “floor” of the graph – in other words, the w0w_0-w1w_1 plane – as all the set of possible combinations of intercept and slope. The height of the surface at any point (w0,w1)(w_0, w_1) is the mean squared error of the hypothesis h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i on the commute times dataset.

Our goal is to find the combination of w0w_0 and w1w_1 that get us to the bottom of the surface, marked by the gold point in the plot. Somehow, this will involve calculus and derivatives, but we’ll need to extend our single variable approach.


Functions of Multiple Variables

Partial Derivatives

How do we take the derivative of a function with multiple input variables?

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

To illustrate, let’s focus on a simpler function with two input variables:

f(x,y)=x2+y29f(x,y) = \frac{x^2 + y^2}{9}

This is a quadratic function of two variables, and its graph is known as a paraboloid.

Loading...

In the single-input case – i.e., for functions of the form f:RRf: \mathbb{R} \to \mathbb{R} – the derivative ddxf(x)\frac{\text{d}}{\text{d}x}f(x) captured f(x)f(x)'s rate of change along the xx-axis, which was the only axis of motion.

The function f(x,y)f(x, y) has two input variables, and so there are two directions along which we can move. As such, we need two “derivatives” to describe the rate of change of f(x,y)f(x, y) – one for the xx-axis and one for the yy-axis. Think of this as a science experiment, where we need control variables to isolate changes to a single variable. Our solution to this dilemma comes in the form of partial derivatives.

If ff has nn input variables, it has nn partial derivatives, one for each axis. The function f(x,y)=x2+y29f(x, y) = \frac{x^2 + y^2}{9} has two partial derivatives, fx(x,y)\frac{\partial f}{\partial x}(x, y) and fy(x,y)\frac{\partial f}{\partial y}(x, y). (The symbol you’re seeing, \partial, is the lowercase Greek letter delta, and is used specifically for partial derivatives.)

Let me show you how to compute partial derivatives before we visualize them. We’ll start with fx(x,y)\frac{\partial f}{\partial x}(x, y).

f(x,y)=x2+y29fx(x,y)=x ⁣(x2+y29)=19x(x2+y2)=19 ⁣(xx2+xy2=0)=19(2x+0)=2x9\begin{align*} f(x,y) &= \frac{x^2 + y^2}{9} \\ \frac{\partial f}{\partial x}(x, y) &=\frac{\partial}{\partial x}\!\left(\frac{x^2+y^2}{9}\right) \\[4pt] &=\frac{1}{9}\,\frac{\partial}{\partial x}(x^2+y^2) \\ &=\frac{1}{9}\!\left(\frac{\partial}{\partial x}x^2+ \underbrace{\frac{\partial}{\partial x}y^2}_{=0}\right) \\ &=\frac{1}{9}\,(2x+0) \\ &=\frac{2x}{9} \end{align*}

The result, fx(x,y)=2x9\frac{\partial f}{\partial x}(x, y) = \frac{2x}{9}, is a function of xx and yy. It tells us the rate of change of f(x,y)f(x,y) along the xx axis, at any point (x,y)(x, y). It just so happens that this function doesn’t involve yy since we chose a relatively simple function ff, but we’ll see more sophisticated examples soon.

Following similar steps, you’ll see that fy(x,y)=2y9\frac{\partial f}{\partial y}(x, y) = \frac{2y}{9}. This gives us:

fx(x,y)=2x9,fy(x,y)=2y9\frac{\partial f}{\partial x}(x, y) = \frac{2x}{9}, \quad \frac{\partial f}{\partial y}(x, y) = \frac{2y}{9}

Let’s pick an arbitrary point and see what the partial derivatives tell us about it. Consider, say, (3,0.5)(-3, 0.5):

  • fx(3,0.5)=2(3)9=23\frac{\partial f}{\partial x}(-3, 0.5) = \frac{2(-3)}{9} = -\frac{2}{3}, so if we holdy\: \color{orange} y \:constant, f\color{orange} f decreases asx\: \color{orange} x \:increases.

  • fy(3,0.5)=2(0.5)9=19\frac{\partial f}{\partial y}(-3, 0.5) = \frac{2(0.5)}{9} = \frac{1}{9}, so if we holdx\: \color{#3d81f6} x \:constant, f\color{#3d81f6} f increases asy\: \color{#3d81f6} y \:increases.

Loading...

Above, we’ve shown the tangent lines in both the xx and yy directions at the point (3,0.5)(-3, 0.5). After all, the derivative of a function at a point tells us the slope of the tangent line at that point; that interpretation remains true with partial derivatives.

Let’s look at a more complex example. Consider:

g(x,y)=x33xy2+2sin(x)cos(y)g(x, y) = x^3 - 3xy^2 + 2 \sin(x) \cos(y)
Loading...

Both partial derivatives are functions of both xx and yy, which is typically what we’ll see.

g(x,y)=x33xy2+2sin(x)cos(y)gx(x,y)=3x23y2+2cos(x)cos(y)gy(x,y)=6xy2sin(x)sin(y)\begin{align*} g(x, y) &= x^3 - 3xy^2 + 2 \sin(x) \cos(y) \\ \frac{\partial g}{\partial x}(x, y) &= 3x^2 - 3y^2 + 2 \cos(x) \cos(y) \\ \frac{\partial g}{\partial y}(x, y) &= -6xy - 2 \sin(x) \sin(y) \end{align*}

To compute gx(x,y)\frac{\partial g}{\partial x}(x, y), we treated yy as a constant. Let me try and make more sense of this.

To help visualize, we’ve drawn the function g(x,y)\color{#d81b60} g(x, y), along with the plane y=a\color{#3d81f6} y = a. The slider lets you change the value of a\color{#3d81f6} a being considered, i.e., it lets you change the constant value that we’re assigning to yy.

The intersection of g(x,y)\color{#3d81f6} g(x, y) and y=a\color{#3d81f6} y = a is marked as a gold curve and is a function of xx alone.

Loading...

Drag the slider to y=1.40\color{#3d81f6} y = 1.40, for example, and look at the gold curve that results. The expression below tells you the derivative of that gold curve with respect to xx.

gx(x,1.40)=3x23(1.40)2+2cos(x)cos(1.40)=3x20.34cos(x)5.88derivative of gold curve w.r.t. x\frac{\partial g}{\partial x}(x, {\color{#3d81f6}1.40}) = 3x^2 - 3({\color{#3d81f6}1.40})^2 + 2 \cos(x) \cos({\color{#3d81f6}1.40}) = \underbrace{3x^2 - 0.34 \cos(x) - 5.88}_\text{derivative of {\color{gold}\textbf{gold curve}} w.r.t. $x$}

Thinking in three dimensions can be difficult, so don’t fret if you’re confused as to what all of these symbols mean – this is all a bit confusing to me too. (Are professors allowed to say this?) Nonetheless, I hope these interactive visualizations are helping you make some sense of the formulas, and if there’s anything I can do to make them clearer, please do tell me!

Optimization

To minimia (or maximize) a function f:RRf: \mathbb{R} \to \mathbb{R}, we solved for critical points, which were points where the (single variable) derivative was 0, and used the second derivative test to classify them as minima or maxima (or neither, like in the case of f(x)=x3f(x) = x^3 at x=0x = 0).

The analog in the R2R\mathbb{R}^2 \rightarrow \mathbb{R} case is solving for the points where both partial derivatives are 0, which corresponds to the points where the function is neither increasing nor decreasing along either axis.

In the case of our first example,

f(x,y)=x2+y29f(x, y) = \frac{x^2 + y^2}{9}

the partial derivatives were relatively simple,

fx=2x9,fy=2y9\frac{\partial f}{\partial x} = \frac{2x}{9}, \quad \frac{\partial f}{\partial y} = \frac{2y}{9}

and both are 0 when x=y=0x = y = 0. So, (0,0,f(0))(0, 0, f(0)) is a critical point, and we can see visually that it’s a global minimum.

(Notice that above I wrote fx\frac{\partial f}{\partial x} and fy\frac{\partial f}{\partial y} instead of fx(x,y)\frac{\partial f}{\partial x}(x, y) and fy(x,y)\frac{\partial f}{\partial y}(x, y) to save space, but don’t forget that both partial derivatives are functions of both xx and yy in general.)

Loading...

There is a second derivative test for functions of multiple variables, but it’s a bit more complicated than the single variable case, and to give you an honest explanation of it, I’ll need to introduce you to quite a bit of linear algebra first. So, we’ll table that thought for now.

The function g(x,y)=x33xy2+2sin(x)cos(y)g(x, y) = x^3 - 3xy^2 + 2 \sin(x) \cos(y) has much more complicated partial derivatives, and so it’s difficult to solve for its critical points by hand. Fear not – in Chapter 4, when we discover the technique of gradient descent, we’ll learn how to minimize such functions just by using their partial derivatives, even when we can’t solve for where they’re 0.


Minimizing Mean Squared Error

Finding the Partial Derivatives

Let’s return to the simple linear regression problem. Recall, the function we’re trying to minimize is:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

Why? By minimizing Rsq(w0,w1)R_\text{sq}(w_0, w_1), we’re finding the intercept (w0w_0^*) and slope (w1w_1^*) of the line that best fits the data. Don’t forget that this goal is the point of all of these mathematical ideas.

Loading...

We’ve learned that to minimize Rsq(w0,w1)R_\text{sq}(w_0, w_1), we’ll need to find both of its partial derivatives, and solve for the point (w0,w1,Rsq(w0,w1))(w_0^*, w_1^*, R_\text{sq}(w_0^*, w_1^*)) at which they’re both 0.

Let’s start with the partial derivative with respect to w0w_0:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2Rsqw0=w0[1ni=1n(yi(w0+w1xi))2]=1ni=1nRsqw0(yi(w0+w1xi))2=1ni=1n2(yi(w0+w1xi))Rsqw0(yi(w0+w1xi))chain rule=1ni=1n2(yi(w0+w1xi))(1)=2ni=1n(yi(w0+w1xi))\begin{align*} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_0} &= \frac{\partial}{\partial w_0} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_0} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_0}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-1) \\ &=-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \end{align*}

Onto w1w_1:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2Rsqw1=w1[1ni=1n(yi(w0+w1xi))2]=1ni=1nRsqw1(yi(w0+w1xi))2=1ni=1n2(yi(w0+w1xi))Rsqw1(yi(w0+w1xi))chain rule=1ni=1n2(yi(w0+w1xi))(xi)=2ni=1nxi(yi(w0+w1xi))\begin{aligned} R_\text{sq}(w_0, w_1) &= \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \\ \frac{\partial R_{\text{sq}}}{\partial w_1} &= \frac{\partial}{\partial w_1} \left[ \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2 \right] \\ &=\frac{1}{n} \sum_{i = 1}^n\frac{\partial R_{\text{sq}}}{\partial w_1} \left( y_i-(w_0+w_1 x_i) \right)^2 \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot \underbrace{\frac{\partial R_{\text{sq}}}{\partial w_1}( y_i-(w_0+w_1 x_i) )}_\text{chain rule} \\ &=\frac{1}{n} \sum_{i = 1}^n 2( y_i-(w_0+w_1 x_i) ) \cdot (-x_i) \\ &=-\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

All in one place now:

Rsqw0=2ni=1n(yi(w0+w1xi))Rsqw1=2ni=1nxi(yi(w0+w1xi))\begin{aligned} &\frac{\partial R_{\text{sq}}}{\partial w_0} = -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) ) \\ &\frac{\partial R_{\text{sq}}}{\partial w_1} = -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) ) \end{aligned}

These look very similar – it’s just Rsqw1\frac{\partial R_{\text{sq}}}{\partial w_1} has an added xix_i in the summation.

Remember, both partial derivatives are functions of two variables: w0w_0 and w1w_1. We’re treating the xix_i’s and yiy_i’s as constants. If I already have a dataset, you can pick an intercept w0w_0 and slope w1w_1 and I can use these formulas to compute the partial derivatives of RsqR_\text{sq} for that combination of intercept and slope.

In case it helps you put things in perspective, here’s how I might implement these formulas in code, assuming that x and y are arrays:

# Assume x and y are defined somewhere above this function.
def partial_R_w0(w0, w1):
    # Sub-optimal technique, since it uses a for-loop.
    total = 0
    for i in range(len(x)):
        total += (y[i] - (w0 + w1 * x[i]))
    return -2 * total / len(x)
    # Returns a single number!

def partial_R_w1(w0, w1):
    # Better technique, as it uses vectorized operations.
    return -2 * np.mean(x * (y - (w0 + w1 * x)))
    # Also returns a single number!

Before we solve for where both Rsqw0\frac{\partial R_{\text{sq}}}{\partial w_0} and Rsqw1\frac{\partial R_{\text{sq}}}{\partial w_1} are 0, let’s visualize them in the context of our loss surface.

Loading...

Click “Slider for values of w0w_0”. No matter where you drag that slider, the resulting gold curve is a function of w1w_1 only. Every gold curve you see when dragging the w0w_0 slider will have a minimum at some value of w1w_1.

Then, click “Slider for values of w1w_1”. No matter where you drag that slider, the resulting gold curve is a function of w0w_0 only, and has some minimum value.

But there is only one combination of w0w_0 and w1w_1 where the gold curves have minimums at the exact same intersecting point. That is the combination of w0w_0 and w1w_1 that minimizes RsqR_\text{sq}, and it’s who we’re searching for.

Solving for the Optimal Parameters

Now, it’s time to analytically (that is, on paper) find the values of w0w_0^* and w1w_1^* that minimize RsqR_\text{sq}. We’ll do so by solving the following system of two equations and two unknowns:

Rsqw0=2ni=1n(yi(w0+w1xi))=0Rsqw1=2ni=1nxi(yi(w0+w1xi))=0\begin{aligned} \frac{\partial R_\text{sq}}{\partial w_0} &= -\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0 \\ \frac{\partial R_\text{sq}}{\partial w_1} &= -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )=0 \end{aligned}

Here’s my plan:

  1. In the first equation, try and isolate for w0w_0; this value will be called w0w_0^*.

  2. Plug the expression for w0w_0^* into the second equation to solve for w1w_1^*.

Let’s start with the first step.

2ni=1n(yi(w0+w1xi))=0-\frac{2}{n} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )=0

Multiplying both sides by n2-\frac{n}{2} gives us:

i=1n(yiactualw0+w1xipredicted)=0\sum_{i = 1}^n( \underbrace{y_i}_\text{actual}-\underbrace{w_0+w_1 x_i}_\text{predicted})=0

Before I continue, I want to highlight that this itself is an importance balance condition, much like those we discussed in Chapter 1.3. It’s saying that the sum of the errors of the optimal line’s predictions – that is, the line with intercept w0w_0^* and slope w1w_1^* – is 0.

Let’s continue with the first step – I’ll try and keep the commentary to a minimum. It’s important to try and replicate these steps yourself, on paper.

i=1n(yi(w0+w1xi))=0i=1n(yiw0w1xi)=0i=1nyii=1nw0i=1nw1xi=0i=1nyinw0w1i=1nxi=0i=1nyiw1i=1nxi=nw0i=1nyinw1i=1nxin=w0w0=yˉw1xˉ\begin{aligned} \sum_{i = 1}^n( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n( y_i-w_0-w_1 x_i )&=0 \\ \sum_{i = 1}^n y_i - \sum_{i = 1}^n w_0 - \sum_{i = 1}^n w_1 x_i&=0 \\ \sum_{i = 1}^n y_i - nw_0 - w_1\sum_{i = 1}^n x_i&=0\\ \sum_{i = 1}^n y_i - w_1\sum_{i = 1}^n x_i&=nw_0 \\ \frac{\sum_{i = 1}^n y_i }{n}- w_1\frac{\sum_{i = 1}^n x_i}{n}&=w_0 \\ w_0^*&=\bar{y}-w_1^* \bar{x} \end{aligned}

Awesome! We’re halfway there. We have a formula for the optimal slope, w0w_0^*, in terms of the optimal intercept, w1w_1^*. Let’s use w0=yˉw1xˉw_0^* = \bar{y}-w_1^* \bar{x} and see where it gets us in the second equation.

2ni=1nxi(yi(w0+w1xi))=0i=1nxi(yi(w0+w1xi))=0i=1nxi(yi(yˉw1xˉw0+w1xi))=0i=1nxi(yiyˉ+w1xˉw1xidistribute negation)=0i=1nxi((yiyˉ)w1(xixˉ))=0i=1nxi(yiyˉ)w1i=1nxi(xixˉ)expand summation=0i=1nxi(yiyˉ)=w1i=1nxi(xixˉ)w1=i=1nxi(yiyˉ)i=1nxi(xixˉ)\begin{aligned} -\frac{2}{n} \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(w_0+w_1 x_i) )&=0 \\ \sum_{i = 1}^n x_i( y_i-(\underbrace{\bar{y}-w_1^* \bar{x}}_{w_0^*}+w_1^* x_i) )&=0 \\ \sum_{i = 1}^n x_i( \underbrace{y_i-\bar{y}+w_1^* \bar{x}-w_1^* x_i}_\text{distribute negation})&=0 \\ \sum_{i = 1}^n x_i \left( (y_i-\bar{y})-w_1^* ( x_i - \bar{x}) \right) &=0 \\ \underbrace{\sum_{i = 1}^n x_i (y_i-\bar{y})-w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x})}_\text{expand summation} &=0 \sum_{i = 1}^n x_i (y_i-\bar{y}) &= w_1^* \sum_{i=1}^n x_i ( x_i - \bar{x}) \\ w_1^* &= \frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})} \end{aligned}

Rewriting and Using the Formulas

We’re done! We have formulas for the optimal slope and intercept. But, before we celebrate, I’m going to try and rewrite w1w_1^* in an equivalent, more symmetrical form that is easier to interpret.

Claim:

w1=i=1nxi(yiyˉ)i=1nxi(xixˉ)formula we derived above=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2nicer looking formulaw_1^* = \underbrace{\frac{\sum_{i = 1}^n x_i (y_i-\bar{y})}{\sum_{i=1}^n x_i ( x_i - \bar{x})}}_\text{formula we derived above} = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{nicer looking formula}

This is not the only other equivalent formula for the slope; for instance, w1=i=1n(xixˉ)yii=1n(xixˉ)2w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})y_i}{\sum_{i=1}^n (x_i - \bar{x})^2} too, and you can verify this using the same logic as in the proof above.

To summarize, the parameters that minimize mean squared error for the simple linear regression model, h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i, are:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,w0=yˉw1xˉ\boxed{w_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}}

This is an important result, and you should remember it. There are a lot of symbols above, but just note that given a dataset (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n), you could apply the formulas above by hand to find the optimal parameters yourself.

What does this line look like on the commute times data?

Image produced in Jupyter

The line above goes by many names:

  • The simple linear regression line that minimizes mean squared error.

  • The simple linear regression line (if said without context).

  • The regression line.

  • The least squares regression line (because it has the least mean squared error).

  • The line of best fit.

Whatever you’d like to call it, now that we’ve found our optimal parameters, we can use them to make predictions.

h(xi)=w0+w1xih(x_i) = w_0^* + w_1^* x_i

On the dataset of commute times:

# Assume x is an array with departure hours and y is an array with commute times.
w1_star = np.sum((x - np.mean(x)) * (y - np.mean(y))) / np.sum((x - np.mean(x)) ** 2)
w0_star = np.mean(y) - w1_star * np.mean(x)

w0_star, w1_star
(142.4482415877287, -8.186941724265552)

So, our specific fit, or trained hypothesis function is:

predicted commute timei=h(departure houri)=142.458.19departure houri\begin{align*} \text{predicted commute time}_i &= h(\text{departure hour}_i) \\ &= 142.45 - 8.19 \cdot \text{departure hour}_i \end{align*}

This trained hypothesis function is not saying that leaving later causes you to have shorter commutes. Rather, that’s just the best linear pattern it observed in the data for the purposes of minimizing mean squared error. In reality, there are other factors that affect commute times, and we haven’t performed a thorough-enough analysis to say anything about the causal relationship between departure time and commute time.

To predict how long it might take to get to school tomorrow, plug in the time you’d like to leave for departure houri\text{departure hour}_i and out will come your prediction. The slope, -8.19, is in units units of yunits of x=minuteshour\frac{\text{units of } y}{\text{units of } x} = \frac{\text{minutes}}{\text{hour}}, and is telling us that for every hour later you leave, your predicted commute time decreases by 8.19 minutes.

In Python, I can define a predict function as follows:

def predict(x_new):
    return w0_star + w1_star * x_new

# Predicted commute time if I leave at 8:30AM.
predict(8.5)
72.8592369314715

Regression Line Passes Through the Mean

There’s an important property that the regression line satisfies: for any dataset, the line that minimizes mean squared error passes through the point (mean of x,mean of y)(\text{mean of } x, \text{mean of } y).

Image produced in Jupyter
predict(np.mean(x))
73.18461538461538
# Same!
np.mean(y)
73.18461538461538

Our commute times regression line passes through the point (xˉ,yˉ)(\bar{x}, \bar{y}), even if that was not necessarily one of the original points in the dataset.

Intuitively, this says that for an average input, the line that minimizes mean squared error will always predict an average output.

Why is this fact true? See if you can reason about it yourself, then check the solution once you’ve attempted it.

The Modeling Recipe

To conclude, let’s run through the three-step modeling recipe.

1. Choose a model.

h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i

2. Choose a loss function.

We chose squared loss:

Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}(y_i, h(x_i)) = (y_i - h(x_i))^2

3. Minimize average loss to find optimal parameters.

For the simple linear regression model, empirical risk is:

Rsq(w0,w1)=1ni=1n(yi(w0+w1xi))2R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2

We showed that:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,w0=yˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

While the process of minimizing RsqR_\text{sq} was much, much more complex than in the case of our single parameter model, the conceptual backing of the process was still this three-step recipe, and hopefully now you see its value.


Correlation

Sometimes, we’re not necessarily interested in making predictions, but instead want to be descriptive about patterns that exist in data.

In a scatter plot of two variables, if there is any pattern, we say the variables are associated. If the pattern resembles a straight line, we say the variables are correlated, i.e. linearly associated. We can measure how much a scatter plot resembles a straight line using the correlation coefficient.

Interpreting the Correlation Coefficient

There are actually many different correlation coefficients; this is the most common one, and it’s sometimes called the Pearson’s correlation coefficient, after the British statistician Karl Pearson.

No matter the values of x1,x2,xnx_1, x_2, \ldots x_n and y1,y2,yny_1, y_2, \ldots y_n, the value of rr is bounded between -1 and 1. The closer r|r| is to 1, the stronger the linear association. The sign of rr tells us the direction of the trend – upwards (positive) or downwards (negative). rr is a unitless quantity – it’s not measured in hours, or dollars, or minutes, or anything else that depends on the units of xx and yy.

Image produced in Jupyter

The plots above give us some examples of what the correlation coefficient can look like in practice.

  • Top left (r=0.046r = 0.046): There’s some loose circle-like pattern, but it mostly looks like a random cloud of points. r|r| is close to 0, but just happens to be positive.

  • Top right (r=0.993r = -0.993): The points are very tightly clustered around a line with a negative slope, so rr is close to -1.

  • Bottom left (r=0.031r = -0.031): While the points are certainly associated, they are not linearly associated, so the value of rr is close to 0. (The shape looks more like a V or parabola than a straight line.)

  • Bottom right (r=0.607r = 0.607): The points are loosely clustered and follow a roughly linear pattern trending upwards. rr is positive, but not particularly large.

The correlation coefficient has some useful properties to be aware of. For one, it’s symmetric: r(x,y)=r(y,x)r(x, y) = r(y, x). If you swap the xix_i’s and yiy_i’s in its formula, you’ll see the result is the same.

r=1ni=1n(xixˉσx)(yiyˉσy)r = \frac{1}{n} \sum_{i = 1}^n \left( \frac{x_i - \bar{x}}{\sigma_x} \right) \left( \frac{y_i - \bar{y}}{\sigma_y} \right)

One way to think of rr is that it’s the mean of the product of xx and yy, once both variables have been standardized. To standardize a collection of numbers x1,x2,xnx_1, x_2, \ldots x_n, you first find the mean xˉ\bar{x} and standard deviation σx\sigma_x of the collection. Then, for each xix_i, you compute:

zi=xixˉσxz_i = \frac{x_i - \bar{x}}{\sigma_x}

This tells you how many standard deviations away from the mean each xix_i is. For example, if zi=1.5z_i = -1.5, that means xix_i is 1.5 standard deviations below the mean of xx. The value of xix_i once it’s standardized is sometimes called its z-score; you may have heard of zz-scores in the context of curved exam scores.

With this in mind, I’ll again state that rr is the mean of the product of xx and yy, once both variables have been standardized:

r=1ni=1n(xixˉσx)xi’s z-score×(yiyˉσy)yi’s z-scorer = {\color{orange} \frac{1}{n} \sum_{i = 1}^n} \underbrace{\left( {\color{#3d81f6} \frac{x_i - \bar{x}}{\sigma_x}} \right)}_{x_i\text{'s $z$-score}} {\color{#d81b60} \times} \underbrace{\left( {\color{#3d81f6} \frac{y_i - \bar{y}}{\sigma_y}} \right)}_{y_i\text{'s $z$-score}}

This interpretation of rr makes it a bit easier to see why rr measures the strength of linear association – because up until now, it must seem like a formula I pulled out of thin air.

If there’s positive linear association, then xix_i and yiy_i will usually either both be above their averages, or both be below their averages, meaning that xixˉx_i - \bar{x} and yiyˉy_i - \bar{y} will usually have the same sign. If we multiply two numbers with the same sign – either both positive or both negative – then the product will be positive.

Image produced in Jupyter

Since most points are in the bottom-left and top-right quadrants, most of the products (xixˉ)(yiyˉ)(x_i - \bar{x})(y_i - \bar{y}) are positive. This means that rr, which is the average of these products divided by the standard deviations of xx and yy, will be positive too. (We divide by the standard deviations to ensure that 1r1-1 \leq r \leq 1.)

Above, rr is positive but not exactly 1, since there are several points in the bottom-right and top-left quadrants, who would have a negative product (xixˉ)(yiyˉ)(x_i - \bar{x})(y_i - \bar{y}) and bring down the average product.

If there’s negative linear association, then typically it’ll be the case that xix_i is above average while yiy_i is below average, or vice versa. This means that xixˉx_i - \bar{x} and yiyˉy_i - \bar{y} will usually have opposite signs, and when they have opposite signs, their product will be negative. If most points have a negative product, then rr will be negative too.

Preserving Correlation

Since rr measures how closely points cluster around a line, it is invariant to units of measurement, or linear transformations of the variables independently.

Image produced in Jupyter

The top left scatter plot is the same as in the previous example, where we reasoned about why rr is positive. The other three plots result from applying linear transformations to the xx and/or yy variables independently. A linear transformation of xx is any function of the form ax+bax + b, and a linear transformation of yy is any function of the form cy+dcy + d. (This is an idea we’ll revisit more in Chapter 2.)

Notice that three of the four plots have the same rr of approximately 0.79. The bottom right plot has an rr of approximately -0.79, because the yy coordinates were multiplied by a negative constant. What we’re seeing is that the correlation coefficient is invariant to linear transformations of the two variables independently.

Put in real-world terms: it doesn’t matter if you measure commute times in hours, minutes, or seconds, the correlation between departure time and commute time will be the same in all three cases.

Correlation and the Regression Line

Since rr measures how closely points cluster around a line, it shouldn’t be all that surprising that rr has something to do with w1w_1^*, the slope of the regression line.

It turns out that:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2from earlier=rσyσxw_1^* = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{from earlier} = \boxed{r \frac{\sigma_y}{\sigma_x}}

This is my preferred version of the formula for the optimal slope – it’s easy to use and interpret. I’ve hidden the proof behind a dropdown menu below, but you really should attempt it on your own (and then read it), since it helps build familiarity with how the various components of the formula for rr and w1w_1^* are related.

The simpler formula above implies that the sign of the slope is the same as the sign of rr, which seems reasonable: if the direction of the linear association is negative, the best-fitting slope should be, too.

So, all in one place:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=rσyσx,w0=yˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r \frac{\sigma_y}{\sigma_x}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

This new formula for the slope also gives us insight into how the spread of xx (σx\sigma_x) and yy (σy\sigma_y) affects the slope. If yy is more spread out than xx, the points on the scatter plot will be stretched out vertically, which will make the best-fitting slope steeper.

Image produced in Jupyter

In the middle example above, yi2yiy_i \rightarrow 2y_i means that we replaced each yiy_i in the dataset with 2yi2y_i. In that example, the slope and intercept of the regression line both doubled. In the third example, where we replaced each xix_i with 3xi3x_i, the slope was divided by 3, while the intercept remained. One of the problems in Homework 2 has you prove these sorts of results, and you can do so by relying on the formula for w1w_1^* that involves rr; note that all three datasets above have the same rr.

Example: Anscombe’s Quartet

The correlation coefficient is just one number that describes the linear association between two variables; it doesn’t tell us everything.

Consider the famous example of Anscombe’s quartet, which consists of four datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different.

Image produced in Jupyter

In all four datasets:

xˉ=9,yˉ=7.5,σx=3.16,σy=1.94,r=0.82\bar{x} = 9, \bar{y} = 7.5, \sigma_x = 3.16, \sigma_y = 1.94, r = 0.82

Because they all share the same five values of these key quantities, they also share the same regression lines, since the optimal slope and intercept are determined just using those five quantities.

w1=rσyσx=0.821.943.16=0.52w0=yˉw1xˉ=7.50.529=2.82w_1^* = r \frac{\sigma_y}{\sigma_x} = 0.82 \frac{1.94}{3.16} = 0.52 \qquad w_0^* = \bar{y} - w_1^* \bar{x} = 7.5 - 0.52 \cdot 9 = 2.82
Image produced in Jupyter

The regression line clearly looks better for some datasets than others, with Dataset IV looking particularly off. A high r|r| may be evidence of a strong linear association, but it cannot guarantee that a linear model is suitable for a dataset. Moral of the story - visualize your data before trying to fit a model! Don’t just trust the numbers.

You might like the Datasaurus Dozen, another similar collection of 13 datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different. (One looks like a dinosaur!)