As we saw in Chapter 8.3, gradient descent is prone to getting stuck in local minima.
When I use gradient descent to minimize an empirical risk function, getting stuck at a local minimum that is not global means landing on a parameter vector that is not actually optimal. That is a real problem.
In modern machine learning, the loss surfaces people care about are often huge and messy, and dealing with bad local minima is part of the game. I am not going to get into those fixes here. Instead, I want to focus on a special family of functions where life is much nicer: convex functions.
The following video, recorded in an earlier semester, summarizes the key ideas of this section.
Formal Definition of Convexity¶
Let me start with the picture, because the picture is the whole point.
If you need a refresher on what a secant line is, take a quick look at the appendix. For a scalar-to-scalar function, I want every secant line between two points on the graph to lie on or above the graph itself. If that happens no matter which two points I pick, the function has the familiar bowl-shaped behavior I want.
The figure below is the geometry I have in mind.
That picture turns almost directly into algebra. Suppose I pick two inputs and , and some .
The input that is a fraction of the way from to is .
The height of the secant line at that same horizontal location is .
So saying “the secant line lies above the graph” is the same as saying
That is the formal definition in one dimension. In higher dimensions, nothing really changes: I just replace the scalar inputs and with vectors and , and think about the line segment connecting them.
For , this is exactly the secant-line picture above. For , it says the same thing along every line segment in the domain.
This is one of those definitions that is much more useful than it first looks. I do not just want a fancy way to say “bowl-shaped.” I want an inequality that I can actually plug into proofs.
Once I know that every point on every line segment satisfies a weighted-average inequality, I can stop arguing from a picture and start choosing , , and strategically. That is what makes the next result work.
Local Minimums are Global Minimums¶
This is the payoff.
Suppose is convex, and suppose is a local minimum of . I want to show that is automatically a global minimum.
Take any other point . Since is a local minimum, points on the line segment from toward that stay close enough to cannot have smaller function value. So for all sufficiently small ,
But convexity also gives
Putting those together,
Subtract from both sides:
Since , I can divide by and get
And since was arbitrary, beats every other point in the domain. So is a global minimum.
This is why convexity matters so much in optimization. It does not magically make minimization easy, but it does remove the possibility of bad local minima.
There is one important caveat, though: convex functions do not need to have global minima in the first place. A standard example is . It is convex, and it keeps getting smaller as I move left, but it never actually attains its infimum of 0.
Strict Convexity¶
Convexity rules out bad local minima, but it does not rule out ties.
A convex function can have a completely flat bottom, in which case every point in that flat region is a local minimum and a global minimum. The example below does exactly that.
So if I want a guarantee of a single best point, I need to strengthen the definition a little.
The only difference is that the inequality is now strict once I move away from the endpoints. Geometrically, the secant line is allowed to touch the graph at the two endpoints, but not in between.
Now I can prove the uniqueness statement I really want. Suppose a strictly convex function has a global minimum, but that there are two different minimizers, and . Let their common minimum value be .
For any , strict convexity says
But that says there is a point with function value smaller than the global minimum value , which is impossible. So a strictly convex function can have at most one global minimum. In other words: if a global minimum exists, it is unique.
Strict Convexity and Mean Squared Error¶
This is exactly the distinction I want you to keep in mind for mean squared error.
Recall that
When the columns of are linearly independent, this risk surface curves upward in every direction, so there is a single best parameter vector. When the columns of are linearly dependent, there are flat directions: I can move in some directions without changing the predictions, and the minimum need not be unique.
The contour plots below show both behaviors.
The left-hand plot is the nice case: one bowl, one bottom, one minimizer. The right-hand plot still describes a convex function, but not a strictly convex one, because there is a whole line of minimizers.
I am not proving the full criterion for here just yet. The clean explanation comes from the Hessian, and one chapter from now we will phrase that explanation using eigenvalues of . But the geometry is already visible: full rank gives curvature in every direction; linear dependence creates flat directions.
Second Derivative Test¶
The formal definition is the ground truth, but it is not always the fastest way to check that a function is convex.
For scalar-to-scalar functions, you may already know the second derivative test from calculus: if a twice-differentiable function satisfies
for all in its domain, then is convex. In fact, the slightly weaker condition is enough for convexity, while points toward strict convexity.
What does this look like for vector-to-scalar functions? Now there is no single second derivative. There are many of them.
For example, if
then the first partial derivatives are
and the second partial derivatives are
The natural thing to do is collect all of those second partial derivatives into a matrix.
For the example above,
which does not even depend on .
The vector-valued second derivative test says: a twice-differentiable function is convex exactly when its Hessian is positive semidefinite everywhere, meaning
If the Hessian is positive definite everywhere, then I get strict convexity.
This brings us back to mean squared error. For
we already computed
Differentiate once more, and the Hessian is
That is a really nice conclusion: the Hessian is constant, and it is always positive semidefinite. So mean squared error is always convex. If the columns of are linearly independent, then is positive definite, which is why becomes strictly convex and has a unique minimizer. If the columns are dependent, there is a zero-curvature direction, which is exactly the flat valley we saw above.
Aside: Tangent Hyperplanes¶
For scalar-to-scalar functions, there is another way to recognize convexity: a differentiable convex function always lies above each of its tangent lines.
For vector-to-scalar functions, the same story holds, except the tangent line becomes a tangent hyperplane. When the input has two coordinates, that hyperplane is literally just a plane in 3D.
The next figure shows a convex surface together with its tangent plane at one point.
Suppose I fix a point . The tangent hyperplane to at is the linear approximation
This is the vector-valued version of the tangent-line formula from calculus.
In words: the graph of a differentiable convex function lies above every tangent hyperplane.
I like this characterization because it tells me that the local linear information in the gradient is globally trustworthy. On a non-convex function, a tangent line or tangent plane can point you in a misleading direction. On a convex function, the tangent hyperplane never overshoots the graph, which is a big part of why gradient-based optimization behaves so nicely in this setting.