Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.4. Correlation

Sometimes, we’re not necessarily interested in making predictions, but instead want to be descriptive about patterns that exist in data.

In a scatter plot of two variables, if there is any pattern, we say the variables are associated. If the pattern resembles a straight line, we say the variables are correlated, i.e. linearly associated. We can measure how much a scatter plot resembles a straight line using the correlation coefficient.

Interpreting the Correlation Coefficient

There are actually many different correlation coefficients; this is the most common one, and it’s sometimes called the Pearson’s correlation coefficient, after the British statistician Karl Pearson.

No matter the values of x1,x2,xnx_1, x_2, \ldots x_n and y1,y2,yny_1, y_2, \ldots y_n, the value of rr is bounded between -1 and 1. The closer r|r| is to 1, the stronger the linear association. The sign of rr tells us the direction of the trend – upwards (positive) or downwards (negative). rr is a unitless quantity – it’s not measured in hours, or dollars, or minutes, or anything else that depends on the units of xx and yy.

Now that we have the optimal line, we introduce correlation as a way to measure the strength of linear relationships.

Image produced in Jupyter

The plots above give us some examples of what the correlation coefficient can look like in practice.

  • Top left (r=0.046r = 0.046): There’s some loose circle-like pattern, but it mostly looks like a random cloud of points. r|r| is close to 0, but just happens to be positive.

  • Top right (r=0.993r = -0.993): The points are very tightly clustered around a line with a negative slope, so rr is close to -1.

  • Bottom left (r=0.031r = -0.031): While the points are certainly associated, they are not linearly associated, so the value of rr is close to 0. (The shape looks more like a V or parabola than a straight line.)

  • Bottom right (r=0.607r = 0.607): The points are loosely clustered and follow a roughly linear pattern trending upwards. rr is positive, but not particularly large.

The correlation coefficient has some useful properties to be aware of. For one, it’s symmetric: r(x,y)=r(y,x)r(x, y) = r(y, x). If you swap the xix_i’s and yiy_i’s in its formula, you’ll see the result is the same.

r=1ni=1n(xixˉσx)(yiyˉσy)r = \frac{1}{n} \sum_{i = 1}^n \left( \frac{x_i - \bar{x}}{\sigma_x} \right) \left( \frac{y_i - \bar{y}}{\sigma_y} \right)

One way to think of rr is that it’s the mean of the product of xx and yy, once both variables have been standardized. To standardize a collection of numbers x1,x2,xnx_1, x_2, \ldots x_n, you first find the mean xˉ\bar{x} and standard deviation σx\sigma_x of the collection. Then, for each xix_i, you compute:

zi=xixˉσxz_i = \frac{x_i - \bar{x}}{\sigma_x}

This tells you how many standard deviations away from the mean each xix_i is. For example, if zi=1.5z_i = -1.5, that means xix_i is 1.5 standard deviations below the mean of xx. The value of xix_i once it’s standardized is sometimes called its z-score; you may have heard of zz-scores in the context of curved exam scores.

With this in mind, I’ll again state that rr is the mean of the product of xx and yy, once both variables have been standardized:

r=1ni=1n(xixˉσx)xi’s z-score×(yiyˉσy)yi’s z-scorer = {\color{orange} \frac{1}{n} \sum_{i = 1}^n} \underbrace{\left( {\color{#3d81f6} \frac{x_i - \bar{x}}{\sigma_x}} \right)}_{x_i\text{'s $z$-score}} {\color{#d81b60} \times} \underbrace{\left( {\color{#3d81f6} \frac{y_i - \bar{y}}{\sigma_y}} \right)}_{y_i\text{'s $z$-score}}

This interpretation of rr makes it a bit easier to see why rr measures the strength of linear association – because up until now, it must seem like a formula I pulled out of thin air.

If there’s positive linear association, then xix_i and yiy_i will usually either both be above their averages, or both be below their averages, meaning that xixˉx_i - \bar{x} and yiyˉy_i - \bar{y} will usually have the same sign. If we multiply two numbers with the same sign – either both positive or both negative – then the product will be positive.

Image produced in Jupyter

Since most points are in the bottom-left and top-right quadrants, most of the products (xixˉ)(yiyˉ)(x_i - \bar{x})(y_i - \bar{y}) are positive. This means that rr, which is the average of these products divided by the standard deviations of xx and yy, will be positive too. (We divide by the standard deviations to ensure that 1r1-1 \leq r \leq 1.)

Above, rr is positive but not exactly 1, since there are several points in the bottom-right and top-left quadrants, who would have a negative product (xixˉ)(yiyˉ)(x_i - \bar{x})(y_i - \bar{y}) and bring down the average product.

If there’s negative linear association, then typically it’ll be the case that xix_i is above average while yiy_i is below average, or vice versa. This means that xixˉx_i - \bar{x} and yiyˉy_i - \bar{y} will usually have opposite signs, and when they have opposite signs, their product will be negative. If most points have a negative product, then rr will be negative too.

Preserving Correlation

Since rr measures how closely points cluster around a line, it is invariant to units of measurement, or linear transformations of the variables independently.

Image produced in Jupyter

The top left scatter plot is the same as in the previous example, where we reasoned about why rr is positive. The other three plots result from applying linear transformations to the xx and/or yy variables independently. A linear transformation of xx is any function of the form ax+bax + b, and a linear transformation of yy is any function of the form cy+dcy + d. (This is an idea we’ll revisit more in Chapter 6.1.)

Notice that three of the four plots have the same rr of approximately 0.79. The bottom right plot has an rr of approximately -0.79, because the yy coordinates were multiplied by a negative constant. What we’re seeing is that the correlation coefficient is invariant to linear transformations of the two variables independently.

Put in real-world terms: it doesn’t matter if you measure commute times in hours, minutes, or seconds, the correlation between departure time and commute time will be the same in all three cases.

Correlation and the Regression Line

Since rr measures how closely points cluster around a line, it shouldn’t be all that surprising that rr has something to do with w1w_1^*, the slope of the regression line.

It turns out that:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2from earlier=rσyσxw_1^* = \underbrace{\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}_\text{from earlier} = \boxed{r \frac{\sigma_y}{\sigma_x}}

This is my preferred version of the formula for the optimal slope – it’s easy to use and interpret. I’ve hidden the proof behind a dropdown menu below, but you really should attempt it on your own (and then read it), since it helps build familiarity with how the various components of the formula for rr and w1w_1^* are related.

The simpler formula above implies that the sign of the slope is the same as the sign of rr, which seems reasonable: if the direction of the linear association is negative, the best-fitting slope should be, too.

So, all in one place:

w1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=rσyσx,w0=yˉw1xˉw_1^* = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r \frac{\sigma_y}{\sigma_x}, \qquad w_0^* = \bar{y} - w_1^* \bar{x}

This new formula for the slope also gives us insight into how the spread of xx (σx\sigma_x) and yy (σy\sigma_y) affects the slope. If yy is more spread out than xx, the points on the scatter plot will be stretched out vertically, which will make the best-fitting slope steeper.

Image produced in Jupyter

In the middle example above, yi2yiy_i \rightarrow 2y_i means that we replaced each yiy_i in the dataset with 2yi2y_i. In that example, the slope and intercept of the regression line both doubled. In the third example, where we replaced each xix_i with 3xi3x_i, the slope was divided by 3, while the intercept remained. One of the problems in Homework 2 has you prove these sorts of results, and you can do so by relying on the formula for w1w_1^* that involves rr; note that all three datasets above have the same rr.

Example: Anscombe’s Quartet

The correlation coefficient is just one number that describes the linear association between two variables; it doesn’t tell us everything.

Consider the famous example of Anscombe’s quartet, which consists of four datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different.

Image produced in Jupyter

In all four datasets:

xˉ=9,yˉ=7.5,σx=3.16,σy=1.94,r=0.82\bar{x} = 9, \bar{y} = 7.5, \sigma_x = 3.16, \sigma_y = 1.94, r = 0.82

Because they all share the same five values of these key quantities, they also share the same regression lines, since the optimal slope and intercept are determined just using those five quantities.

w1=rσyσx=0.821.943.16=0.52w0=yˉw1xˉ=7.50.529=2.82w_1^* = r \frac{\sigma_y}{\sigma_x} = 0.82 \frac{1.94}{3.16} = 0.52 \qquad w_0^* = \bar{y} - w_1^* \bar{x} = 7.5 - 0.52 \cdot 9 = 2.82
Image produced in Jupyter

The regression line clearly looks better for some datasets than others, with Dataset IV looking particularly off. A high r|r| may be evidence of a strong linear association, but it cannot guarantee that a linear model is suitable for a dataset. Moral of the story - visualize your data before trying to fit a model! Don’t just trust the numbers.

You might like the Datasaurus Dozen, another similar collection of 13 datasets that all have the same mean, standard deviation, and correlation coefficient, but look very different. (One looks like a dinosaur!)