Sums and averages play an important role in machine learning. In Chapter 1.2, we’ll learn to take the average of an important measurement (called a “loss function”) for every value in our dataset.
Here, we’ll review the most relevant properties of summation notation, and use the arithmetic mean as a case study of sorts.
For example, if we take xi=i2, then i=1∑6i2 represents the sum of the squares of all integers from 1 to 6:
i=1∑6i2=12+22+32+42+52+62
Notice that both the starting and ending indices (1 and 6, respectively) are included in the sum. Summation notation allows us to express sums conveniently – the left-hand side above is more compact than the right-hand side.
Often, we’ll take the sum of the first n terms of a sequence. For example, the sum of the squares of the first n positive integers is:
i=1∑ni2
Note that the index of summation can be any variable name (i is just a typical choice). That is, j=1∑nj2, i=1∑ni2, and zebra=1∑nzebra2 all represent the same sum.
Summation notation can be thought of in terms of a for-loop. In Python, to compute the sum i=a∑bi2, we could write:
total = 0
for i in range(a, b + 1):
total = total + i ** 2
As we mentioned above, the ending index is inclusive in summation notation. This is in contrast to Python, where the ending index is exclusive, which is why we provided b + 1 as the second argument to the range function instead of b.
To illustrate various properties of summation notation, we’ll use the following fact:
i=0∑n2i=20+21+22+...+2n=2n+1−1
The expression 2n+1−1 is called a closed form for the summation. When closed forms exist, they make it easy to compute the value of a summation. You’ll also notice that in addition to writing the sum using summation notation, I also showed the first few and last terms of the sum, with a ... to indicate that the pattern continues in between. As convenient as summation notation is, explicitly writing the first few and last terms of a sum can sometimes make it easier to understand what exactly is being summed.
For example, i=0∑32i=20+21+22+23=1+2+4+8=15, which is indeed 23+1−1=15. The sequence 20,21,22,…,2n is called a geometric sequence, and the resulting sum is called a geometric series.
For convenience, we’ll define S(n)=i=0∑n2i; again, S(n)=2n+1−1.
The best way to learn through these examples is to try to solve them yourself before looking at the solution.
Determine the value of i=8∑192i. (Find an answer that doesn’t involve summation notation or a sum over many terms.)
Solution
We can’t use the formula for S(n) directly, because the starting index is 8, not 0. But, if we look at the expansion of S(19), we’ll see that it contains the sum we’re looking for:
As we did in the very first example, let’s try and write the sum as a difference of two calls to S(n). S(n) involves the sum of terms of the form 2i, and the sum we’re looking for involves the terms 2i−5, so we’ll need to shift the indices.
When i=10, i−5=5, and when i=20, i−5=15. So, we can rewrite the sum in question as:
i=10∑202i−5=i=5∑152i
This looks like S(15)−S(4), or 216−1−(25−1)=216−25=65536−32=65504.
As I mentioned at the start of this section, we’ll work with sums of data points quite frequently in this class. We’ll often set up a problem by saying we have a sequence of n scalar[1] values, represented by x1,x2,…,xn. For instance, perhaps there are n students in this course, and xi represents the height of student i.
The mean, or average, of all n values is given the symbol xˉ (pronounced “x-bar”) and is defined as follows:
xˉ=nx1+x2+…+xn=n1i=1∑nxi
You’ve likely seen this definition before. But, an often-forgotten property of the mean is that the sum of the deviations from the mean is zero. By that, I mean (no pun intended) that if you:
compute the mean of a sequence of numbers,
compute the signed difference between each number and the mean, and then
sum all of those differences, the result will be zero.
Let’s first see this in action, then show why it is true in general. Suppose there are only 4 students in the class, with heights 72, 63, 68, and 65 inches. The mean of these heights is:
xˉ=472+63+68+65=67
The deviations from the mean are:
72−6763−6768−6765−67=5=−4=1=−2
The sum of the four deviations, then, is:
−5+(−4)+1+(−2)=0
So, the mean deviation from the mean is zero in this example.
This is also true in general. Precisely, I’m claiming that if x1,x2,...,xn−1,xn are any n numbers, and xˉ is their mean, then i=1∑n(xi−xˉ)=0.
So, we’ve shown that the sum of the deviations from the mean is 0 in general. A consequence of this is that the positive deviations and negative deviations are equal in magnitude, since they need to cancel each other out. In the 72, 63, 68, 65 example, the positive deviations are 5 and 1 and the negative deviations are -4 and -2, and both have magnitude 6. As a result, the mean is sometimes thought of as the “balance point” of the dataset – the point at which the negative deviations are balanced by the positive deviations. More on this in Chapter 1.3.
Since the sum (and average) of deviations from the mean is 0, no matter the dataset, we can’t use the average deviation from the mean to measure how far values tend to be from the mean. The average deviation will be 0, whether the dataset is tightly clustered or spread out.
To measure how far values in a dataset tend to deviate from their mean, then, we’ll need to address the fact that some deviations are positive and some are negative. A common approach is to:
Compute the mean of the dataset
Compute the deviation of each value from the mean
Square each deviation
Take the average of the squared deviations
The result of this process is called the variance, denoted s2 or σ2; its square root is called the standard deviation.
Following the above steps, the variance is given by:
σ2=n1i=1∑n(xi−xˉ)2
(If you’ve read Chapter 1.2, you’ll notice some similarities between this formula and the formula for mean squared error.)
Activity 3
What is the variance of the dataset 72, 63, 68, 65?
As a final note, the variance has a convenient, equivalent form:
σ2=n1i=1∑n(xi−xˉ)2=n1i=1∑nxi2−xˉ2
In English, we might say the variance is “the mean of the squares of x minus the square of the mean of x”.
Your turn: prove this equivalent form of the variance.