Skip to article frontmatterSkip to article content

Chapter 0.1. Summation Notation and the Mean

Sums and averages play an important role in machine learning. In Chapter 1.2, we’ll learn to take the average of an important measurement (called a “loss function”) for every value in our dataset.

Here, we’ll review the most relevant properties of summation notation, and use the arithmetic mean as a case study of sorts.

Introduction

For example, if we take xi=i2x_i = i^2, then i=16i2\displaystyle \sum_{i = 1}^6 i^2 represents the sum of the squares of all integers from 1 to 6:

i=16i2=12+22+32+42+52+62\sum_{i = 1}^6 i^2 = 1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2

Notice that both the starting and ending indices (1 and 6, respectively) are included in the sum. Summation notation allows us to express sums conveniently – the left-hand side above is more compact than the right-hand side.

Often, we’ll take the sum of the first nn terms of a sequence. For example, the sum of the squares of the first nn positive integers is:

i=1ni2\sum_{i = 1}^n i^2

Note that the index of summation can be any variable name (ii is just a typical choice). That is, j=1nj2\displaystyle \sum_{j = 1}^n j^2, i=1ni2\displaystyle \sum_{i = 1}^n i^2, and zebra=1nzebra2\displaystyle \sum_{\text{zebra} = 1}^n \text{zebra}^2 all represent the same sum.

Summation notation can be thought of in terms of a for-loop. In Python, to compute the sum i=abi2\displaystyle \sum_{i = a}^b i^2, we could write:

total = 0
for i in range(a, b + 1):
	total = total + i ** 2

As we mentioned above, the ending index is inclusive in summation notation. This is in contrast to Python, where the ending index is exclusive, which is why we provided b + 1 as the second argument to the range function instead of b.


Properties and Examples

To illustrate various properties of summation notation, we’ll use the following fact:

i=0n2i=20+21+22+...+2n=2n+11\sum_{i = 0}^n 2^i = 2^0 + 2^1 + 2^2 + ... + 2^n = 2^{n+1} - 1

The expression 2n+112^{n+1} - 1 is called a closed form for the summation. When closed forms exist, they make it easy to compute the value of a summation. You’ll also notice that in addition to writing the sum using summation notation, I also showed the first few and last terms of the sum, with a ... to indicate that the pattern continues in between. As convenient as summation notation is, explicitly writing the first few and last terms of a sum can sometimes make it easier to understand what exactly is being summed.

For example, i=032i=20+21+22+23=1+2+4+8=15\displaystyle \sum_{i = 0}^3 2^i = 2^0 + 2^1 + 2^2 + 2^3 = 1 + 2 + 4 + 8 = 15, which is indeed 23+11=152^{3+1} - 1 = 15. The sequence 20,21,22,,2n2^0, 2^1, 2^2, \ldots, 2^n is called a geometric sequence, and the resulting sum is called a geometric series.

For convenience, we’ll define S(n)=i=0n2i\displaystyle S(n) = \sum_{i = 0}^n 2^i; again, S(n)=2n+11S(n) = 2^{n+1} - 1.

The best way to learn through these examples is to try to solve them yourself before looking at the solution.

Example: Partial Sums

Determine the value of i=8192i\displaystyle \sum_{i = 8}^{19} 2^i. (Find an answer that doesn’t involve summation notation or a sum over many terms.)

Example: Constant Multiples

Determine the value of i=0752i\displaystyle \sum_{i = 0}^{7} 5 \cdot 2^i.

Example: Sum of a Constant

Determine the value of i=7205\displaystyle \sum_{i = 7}^{20} 5.

Example: Shifting Indices

Determine the value of i=10202i5\displaystyle \sum_{i = 10}^{20} 2^{i - 5}.

Example: Separating Sums

Given that i=1ni=n(n+1)2\displaystyle \sum_{i = 1}^{n} i = \frac{n(n+1)}{2}, determine the value of i=012(2i+5i)\displaystyle \sum_{i = 0}^{12} (2^i + 5i).

Key Takeaways

In the order they were introduced in the examples, here are some useful properties of summations:

  1. Partial Sums:

    i=1nxi=i=1kxi+i=k+1nxi\sum_{i = 1}^{n} x_i = \sum_{i = 1}^{k} x_i + \sum_{i = k+1}^{n} x_i
  2. Constant Multiples:

    i=1ncxi=ci=1nxi\sum_{i = 1}^{n} c x_i = c \sum_{i = 1}^{n} x_i
  3. Sum of a Constant:

    i=1nc=cn\sum_{i = 1}^{n} c = c \cdot n
  4. Shifting Indices:

    i=knxi=i=0nkxi+k\sum_{i = k}^{n} x_i = \sum_{i = 0}^{n-k} x_{i+k}
  5. Separating Sums:

    i=1n(xi+yi)=i=1nxi+i=1nyi\sum_{i = 1}^{n} (x_i + y_i) = \sum_{i = 1}^{n} x_i + \sum_{i = 1}^{n} y_i

For more practice, try the following activities.


Mean and Standard Deviation

As I mentioned at the start of this section, we’ll work with sums of data points quite frequently in this class. We’ll often set up a problem by saying we have a sequence of nn scalar[1] values, represented by x1,x2,,xnx_1, x_2, \ldots, x_n. For instance, perhaps there are nn students in this course, and xix_i represents the height of student ii.

Mean

The mean, or average, of all nn values is given the symbol xˉ\bar{x} (pronounced “x-bar”) and is defined as follows:

xˉ=x1+x2++xnn=1ni=1nxi\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n} = \frac{1}{n} \sum_{i = 1}^n x_i

You’ve likely seen this definition before. But, an often-forgotten property of the mean is that the sum of the deviations from the mean is zero. By that, I mean (no pun intended) that if you:

  1. compute the mean of a sequence of numbers,
  2. compute the signed difference between each number and the mean, and then
  3. sum all of those differences, the result will be zero.

Let’s first see this in action, then show why it is true in general. Suppose there are only 4 students in the class, with heights 72, 63, 68, and 65 inches. The mean of these heights is:

xˉ=72+63+68+654=67\bar{x} = \frac{72 + 63 + 68 + 65}{4} = 67

The deviations from the mean are:

7267=56367=46867=16567=2\begin{align*} 72 - 67 &= 5 \\ 63 - 67 &= -4 \\ 68 - 67 &= 1 \\ 65 - 67 &= -2 \end{align*}

The sum of the four deviations, then, is:

5+(4)+1+(2)=0-5 + (-4) + 1 + (-2) = 0

So, the mean deviation from the mean is zero in this example.

This is also true in general. Precisely, I’m claiming that if x1,x2,...,xn1,xnx_1, x_2, ..., x_{n-1}, x_n are any nn numbers, and xˉ\bar{x} is their mean, then i=1n(xixˉ)=0\displaystyle \sum_{i = 1}^n (x_i - \bar{x}) = 0.

Let’s prove it:

i=1n(xixˉ)=i=1nxii=1nxˉ=i=1nxinxˉ=i=1nxin(x1+x2++xnn)=i=1nxi(x1+x2++xn)=0\begin{align*} \sum_{i = 1}^n (x_i - \bar{x}) &= \sum_{i = 1}^n x_i - \sum_{i = 1}^n \bar{x} \\ &= \sum_{i = 1}^n x_i - n \bar{x} \\ &= \sum_{i = 1}^n x_i - n \left( \frac{x_1 + x_2 + \ldots + x_n}{n} \right) \\ &= \sum_{i = 1}^n x_i - (x_1 + x_2 + \ldots + x_n) \\ &= 0 \end{align*}

So, we’ve shown that the sum of the deviations from the mean is 0 in general. A consequence of this is that the positive deviations and negative deviations are equal in magnitude, since they need to cancel each other out. In the 72, 63, 68, 65 example, the positive deviations are 5 and 1 and the negative deviations are -4 and -2, and both have magnitude 6. As a result, the mean is sometimes thought of as the “balance point” of the dataset – the point at which the negative deviations are balanced by the positive deviations. More on this in Chapter 1.3.

Standard Deviation

Since the sum (and average) of deviations from the mean is 0, no matter the dataset, we can’t use the average deviation from the mean to measure how far values tend to be from the mean. The average deviation will be 0, whether the dataset is tightly clustered or spread out.

To measure how far values in a dataset tend to deviate from their mean, then, we’ll need to address the fact that some deviations are positive and some are negative. A common approach is to:

  1. Compute the mean of the dataset
  2. Compute the deviation of each value from the mean
  3. Square each deviation
  4. Take the average of the squared deviations

The result of this process is called the variance, denoted s2s^2 or σ2\sigma^2; its square root is called the standard deviation.

Following the above steps, the variance is given by:

σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2

(If you’ve read Chapter 1.2, you’ll notice some similarities between this formula and the formula for mean squared error.)

As a final note, the variance has a convenient, equivalent form:

σ2=1ni=1n(xixˉ)2=1ni=1nxi2xˉ2\sigma^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \bar{x}^2

In English, we might say the variance is “the mean of the squares of xx minus the square of the mean of xx”.

Your turn: prove this equivalent form of the variance.

Footnotes
  1. “Scalar” just means “individual number”, as opposed to a vector or matrix which can contain multiple numbers, as we’ll see later in the course.