Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1.3. Absolute Loss

The Modeling Recipe

In Chapter 1.2, we implicitly introduced a three-step process for building a machine learning model.

Image produced in Jupyter

Most modern supervised learning algorithms follow these same three steps, just with different models, loss functions, and techniques for optimization.

Another name given to this process is empirical risk minimization.

When using squared loss, all three of these mean the same thing:

  • Average squared loss.

  • Mean squared error.

  • Empirical risk.

Risk is an idea from theoretical statistics that we’ll visit in a later chapter on probability. It refers to the expected error of a model, when considering the probability distribution of the data. “Empirical” risk refers to risk calculated using an actual, concrete dataset, rather than a theoretical distribution. The reason we call the average loss RR is precisely because it is empirical risk.

The first half of the course – and in some ways, the entire course – is focused on empirical risk minimization, and so we will make many passes through the three-step modeling recipe ourselves, with differing models and loss functions.

A common question you’ll see in labs, homeworks, and exams will involve finding the optimal model parameters for a given model and loss function – in particular, for a combination of model and loss function that you’ve never seen before. For practice with this sort of exercise, work through the following activity. If you feel stuck, try reading through the rest of this section for context, then come back.


Absolute Loss

When we first introduced the idea of a loss function, we first started by computing the error, eie_i, of each prediction:

ei=yih(xi)e_i={\color{3D81F6}y_i}-{\color{orange}h(x_i)}

where yi{\color{3D81F6}y_i} is the actual value and h(xi){\color{orange}h(x_i)} is the predicted value.

The issue was that some errors were positive and some were negative, and so it was hard to compare them directly. We wanted the value of the loss function to be large for bad predictions and small for good predictions.

To get around this, we squared the errors, which gave us squared loss:

Lsq(yi,h(xi))=(yih(xi))2L_\text{sq}({\color{3D81F6}y_i}, {\color{orange}h(x_i)})=({\color{3D81F6}y_i}-{\color{orange}h(x_i)})^2

But, instead, we could have taken the absolute value of the errors. Doing so gives us absolute loss:

Labs(yi,h(xi))=yih(xi)L_\text{abs}({\color{3D81F6}y_i}, {\color{orange}h(x_i)})=|{\color{3D81F6}y_i}-{\color{orange}h(x_i)}|

Below, I’ve visualized the absolute loss and squared loss for just a single data point.

Image produced in Jupyter

You should notice two key differences between the two loss functions:

  1. The absolute loss function is not differentiable when yi=h(xi)y_i = h(x_i). The absolute value function, f(x)=xf(x) = |x|, does not have a derivative at x=0x=0, because its slope to the left of x=0x=0 (-1) is different from its slope to the right of x=0x=0 (1). For more on this idea, see Appendix 2.

  2. The squared loss function grows much faster than the absolute loss function, as the prediction h(xi)h(x_i) gets further away from the actual value yiy_i.

We know the optimal constant prediction, ww^*, when using squared loss, is the mean. What is the optimal constant prediction when using absolute loss? The answer is not still the mean; rather, the answer reflects some of these differences between squared loss and absolute loss.

Let’s find that new optimal constant prediction, ww^*, by revisiting the three-step modeling recipe.

  1. Choose a model.

    We’ll stick with the constant model, h(xi)=wh(x_i) = w.

  2. Choose a loss function.

    We’ll use absolute loss:

    Labs(yi,h(xi))=yih(xi)L_\text{abs}(y_i, h(x_i)) = |y_i - h(x_i)|

    For the constant model, since h(xi)=wh(x_i) = w, we have:

    Labs(yi,w)=yiwL_\text{abs}(y_i, w) = |y_i - w|
  3. Minimize average loss to find optimal model parameters.

    The average loss – also known as mean absolute error here – is:

    Rabs(w)=1ni=1nyiwR_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n |y_i - w|

In Chapter 1.2, we minimized Rsq(w)=1ni=1n(yiw)2\displaystyle R_\text{sq}(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w)^2 by taking the derivative of Rsq(w)R_\text{sq}(w) with respect to ww and setting it equal to 0. That will be more challenging in the case of Rabs(w)R_\text{abs}(w), because the absolute value function is not differentiable when its input is 0, as we just discussed.


Mean Absolute Error for the Constant Model

We need to minimize the mean absolute error, Rabs(w)R_\text{abs}(w), for the constant model, h(xi)=wh(x_i) = w, but we have to address the fact that Rabs(w)R_\text{abs}(w) is not differentiable across its entire domain.

Rabs(w)=1ni=1nyiwR_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n |y_i - w|

Graphing Mean Absolute Error

I think it’ll help to visualize what Rabs(w)R_\text{abs}(w) looks like. To do so, let’s reintroduce the small dataset of 5 values we used in Chapter 1.2.

y1=72,y2=90,y3=61,y4=85,y5=92y_1=72, \quad y_2=90, \quad y_3=61, \quad y_4=85, \quad y_5=92

Then, Rabs(w)R_\text{abs}(w) is:

Rabs(w)=15(72w+90w+61w+85w+92w)R_\text{abs}(w) = \frac{1}{5} (|72 - w| + |90 - w| + |61 - w| + |85 - w| + |92 - w|)
Image produced in Jupyter

This is a piecewise linear function. Where are the “bends” in the graph? Precisely where the data points, y1,y2,,y5y_1, y_2, \ldots, y_5, are! Its at exactly these points where Rabs(w)R_\text{abs}(w) is not differentiable. At each of those points, the slope of the line segment approaching from the left is different from the slope of the line segment approaching from the right, and for a function to be differentiable at a point, the slope of the tangent line must be the same when approaching from the left and the right.

The graph of Rabs(w)R_\text{abs}(w) above, while not differentiable at any of the data points, still shows us something about the optimal constant prediction. If there is a bend at each data point, and at each bend the slope increases – that is, becomes more positive – then the optimal constant prediction seems to be in the middle, when the slope goes from negative to positive. I’ll make this more precise in a moment.

For now, you might notice the value of ww that minimizes the graph of Rabs(w)R_\text{abs}(w) above is a familiar summary statistic, but not the mean. I won’t spell it out just yet, since I’d like for you to reason about it yourself.

Let me show you one more graph of Rabs(w)R_\text{abs}(w), but this time, in a case where there are an even number of data points. Suppose we have a sixth point, y6=78y_6=78.

y1=72,y2=90,y3=61,y4=85,y5=92,y6=78y_1=72, \quad y_2=90, \quad y_3=61, \quad y_4=85, \quad y_5=92, \quad y_6=78

Then, Rabs(w)R_\text{abs}(w) is:

Rabs(w)=16(72w+90w+61w+85w+92w+78w)R_\text{abs}(w) = \frac{1}{6} (|72 - w| + |90 - w| + |61 - w| + |85 - w| + |92 - w| + |78 - w|)

And its graph is:

Image produced in Jupyter

This graph is broken into 7 segments, with 6 bends (one per data point). Between the 3rd and 4th bends – that is, the 3rd and 4th data points – the slope is 0, and all values in that interval minimize Rabs(w)R_\text{abs}(w). So, it seems that the value of ww^* doesn’t have to be unique!

Minimizing Mean Absolute Error

From the two graphs above, you may have a clear picture of what the optimal constant prediction, ww^*, is. But, to avoid relying too heavily on visual intuition and just a single set of example data points, let’s try and minimize Rabs(w)R_\text{abs}(w) mathematically, for an arbitrary set of data points.

To be clear, the goal is to minimize:

Rabs(w)=1ni=1nyiwR_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n |y_i - w|

To do so, we’ll take the derivative of Rabs(w)R_\text{abs}(w) with respect to ww and set it equal to 0.

ddwRabs(w)=ddw(1ni=1nyiw)\frac{\text{d}}{\text{d}w} R_\text{abs}(w) = \frac{\text{d}}{\text{d}w} \left( \frac{1}{n} \sum_{i=1}^n |y_i - w| \right)

Using the familiar facts that the derivative of a sum is the sum of the derivatives, and that constants can be pulled out of the derivative, we have:

ddwRabs(w)=1ni=1nddwyiw\frac{\text{d}}{\text{d}w} R_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n \frac{\text{d}}{\text{d}w} |y_i - w|

Here’s where the challenge comes in. What is ddwyiw\frac{\text{d}}{\text{d}w} |y_i - w|?

Let’s start by remembering the derivative of the absolute value function. The absolute value function itself can be thought of as a piecewise function:

x={xx0xx<0|x| = \begin{cases} x & x \geq 0 \\ -x & x < 0 \end{cases}

Note that the x=0x=0 case can either lumped in either the xx or x-x case, since 0 and -0 are both 0.

Using this logic, I’ll write yiw|y_i - w| as a piecewise of ww:

yiw={yiwwyiwyiw>yi|y_i - w| = \begin{cases} y_i - w & w \leq y_i \\ w - y_i & w > y_i \end{cases}

I have written the two conditions with ww on the left, since it’s easier to think in terms of ww in my mind, but this means that the inequalities are flipped relative to how I presented them in the definition of x|x|. Remember, yiw|y_i - w| is a function of ww; we’re treating yiy_i as some constant. If it helps, replace every instance of yiy_i with a concrete number, like 5, then reason through the resulting graph.

Image produced in Jupyter

Now we can take the derivative of each piece:

yiw={1w<yiundefinedw=yi1w>yi|y_i - w| = \begin{cases} -1 & w < y_i \\ \text{undefined} & w = y_i \\ 1 & w > y_i \end{cases}

Great. Remember, this is the derivative of the absolute loss for a single data point. But our main objective is to find the derivative of the average absolute loss, Rabs(w)R_\text{abs}(w). Using this piecewise definition of ddwyiw\frac{\text{d}}{\text{d}w} |y_i - w|, we have:

ddwRabs(w)=1ni=1nddwyiw=1ni=1n{1w<yiundefinedw=yi1w>yi\begin{align*} \frac{\text{d}}{\text{d}w} R_\text{abs}(w) &= \frac{1}{n} \sum_{i=1}^n \frac{\text{d}}{\text{d}w} |y_i - w| \\ &= \frac{1}{n} \sum_{i=1}^n \begin{cases} -1 & w < y_i \\ \text{undefined} & w = y_i \\ 1 & w > y_i \end{cases} \end{align*}

At any point where w=yiw = y_i, for any value of ii, ddwRabs(w)\frac{\text{d}}{\text{d}w} R_\text{abs}(w) is undefined. (This makes any point where w=yiw = y_i a critical point.) Let’s exclude those values of ww from our consideration. In all other cases, the sum in the expression above involves only two possible values: -1 and 1.

  • The sum adds -1 for all data points greater than ww, i.e. where w<yiw < y_i.

  • The sum adds 1 for all data points less than ww, i.e. where w>yiw > y_i.

Using some creative notation, I’ll re-write ddwRabs(w)\frac{\text{d}}{\text{d}w} R_\text{abs}(w) as:

ddwRabs(w)=1n(w<yi1+w>yi1)\frac{\text{d}}{\text{d}w} R_\text{abs}(w) = \frac{1}{n} \left( \sum_{w < y_i} -1 + \sum_{w > y_i} 1 \right)

The sum w<yi1\displaystyle \sum_{w < y_i} -1 is the sum of -1 for all data points greater than ww, so perhaps a more intuitive way to write it is:

w<yi1=(1)+(1)++(1)add once per data point to the right of w=(# right of w)\sum_{w < y_i} -1 = \underbrace{(-1) + (-1) + \ldots + (-1)}_{\text{add once per data point \\ to the right of } w} = -(\text{\# right of } w)

Equivalently, w>yi1=(# left of w)\displaystyle \sum_{w > y_i} 1 = (\text{\# left of } w), meaning that:

ddwRabs(w)=1n((# right of w)+(# left of w))=# left of w# right of wn\begin{align*} \frac{\text{d}}{\text{d}w} R_\text{abs}(w) &= \frac{1}{n} \left( -(\text{\# right of } w) + (\text{\# left of } w) \right) \\ &= \boxed{\frac{\text{\# left of } w - \text{\# right of } w}{n}} \end{align*}

By “left of ww”, I mean less than ww.

This boxed form gives us the slope of Rabs(w)R_\text{abs}(w), for any point ww that is not an original data point. To put it in perspective, let’s revisit the first graph we saw in this section, where we plotted Rabs(w)R_\text{abs}(w) for the dataset:

y1=72,y2=90,y3=61,y4=85,y5=92y_1=72, \quad y_2=90, \quad y_3=61, \quad y_4=85, \quad y_5=92
Rabs(w)=15(72w+90w+61w+85w+92w)R_\text{abs}(w) = \frac{1}{5} (|72 - w| + |90 - w| + |61 - w| + |85 - w| + |92 - w|)
Image produced in Jupyter

Now that we have a formula for ddwRabs(w)\frac{\text{d}}{\text{d}w} R_\text{abs}(w), the easy thing to claim is that we could set it to 0 and solve for ww. Doing so would give us:

# left of w# right of wn=0\frac{\text{\# left of } w - \text{\# right of } w}{n} = 0

Which yields the condition:

# left of w=# right of w\text{\# left of } w = \text{\# right of } w

The optimal value of ww is the one that satisfies this condition, and that’s precisely the median of the data, as you may have noticed earlier.

This logic isn’t fully rigorous, however, because the formula for ddwRabs(w)\frac{\text{d}}{\text{d}w} R_\text{abs}(w) is only valid for ww’s that aren’t original data points, and if we have an odd number of data points, the median is indeed one of the original data points. In the graph above, there is never a point where the slope is 0.

To fully justify why the median minimizes mean absolute error even when there are an odd number of data points, I’ll say that:

  • If ww is just to the left of the median, there are more points to the right of ww than to the left of ww, so (# left of w)<(# right of w)(\text{\# left of } w) < (\text{\# right of } w) and (# left of w)(# right of w)n\frac{(\text{\# left of } w) - (\text{\# right of } w)}{n} is negative.

  • If ww is just to the right of the median, there are more points to the left of ww than to the right of ww, so (# left of w)>(# right of w)(\text{\# left of } w) > (\text{\# right of } w) and (# left of w)(# right of w)n\frac{(\text{\# left of } w) - (\text{\# right of } w)}{n} is positive.

So even though the slope is undefined at the median, we know it is a point at which the sign of the derivative switches from negative to positive, and as we discussed in Appendix 2, this sign change implies at least a local minimum.

To summarize:

  • If nn is odd, the median minimizes mean absolute error.

  • If nn is even, any value between the middle two values (when sorted) minimizes mean absolute error. (It’s common to call the mean of the middle two values the median.)

We’ve just made a second pass through the three-step modeling recipe:

  1. Choose a model.

    h(xi)=wh(x_i) = w

  2. Choose a loss function.

    Rabs(w)=1ni=1nyiwR_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n |y_i - w|

  3. Minimize average loss to find optimal model parameters.

    Rabs(w)=1ni=1nyiw    w=Median(y1,y2,,yn)R_\text{abs}(w) = \frac{1}{n} \sum_{i=1}^n |y_i - w| \implies w^* = \text{Median}(y_1, y_2, \ldots, y_n)

Conclusion

What we’ve now discovered is that the optimal model parameter (in this case, the optimal constant prediction) depends on the loss function we choose!

In the context of the commute times dataset from Chapter 1.2, our two optimal constant predictions can be visualized as flat lines, as shown below.

Image produced in Jupyter

Depending on your criteria for what makes a good or bad prediction (i.e., the loss function you choose), optimal model parameters may change.


Next, we’ll compare absolute loss to squared loss and see how different loss choices change the optimal constant model.