The title of this post is the name of a recently uploaded arxiv preprint by Lee et al. This work is by the same researchers who wrote Gradient Descent Converges to Minimizers (in COLT 2016), as well as some folks who followed up on that work by answering some open questions about step size upper-bounds and non-isolated critical points. The original paper focused on the convergence of vanilla gradient descent, while this new submission significantly expands that work to prove convergence properties of the proximal point algorithm, (block) coordinate descent, manifold gradient descent, and mirror descent.

There’s been a lot of recent work on understanding why first-order descent methods work so well in so many non-convex settings. A lot of this has been inspired by stochastic gradient descent’s huge success in deep learning, where the loss surfaces are highly non-convex. Some big-shot questions include:

- How good are local minima on these loss surfaces?
- How prevalent are saddle points on these loss surfaces?
- Are second-order methods very useful in this setting? Are first-order methods good enough?
- What kind of critical points does gradient descent (or some other first-order method) find?

This paper is mainly looking at the last point. In particular, it’s asking about the likelihood that first-order methods converge to saddle points. As my first blog post, I chose this paper because I found its proof techniques elegant and intuitive. The main contribution is clearly marked out in the very first line of the abstract:

We establish that first-order methods avoid saddle points for almost all initializations.

Very cool! Let’s dive into what this means. For simplicity’s sake, I’m going to outline their work on gradient descent, and then towards the end, I’ll talk about what conditions they prove for the other first-order methods.

We’ll be working in the domain , with a twice-continuously differentiable objective function . We denote the gradient as and the Hessian as . Our goal is to find a minimizer of . Recall that a point is called a critical point of if .

Let’s recall some basic properties about the eigenvalues of the Hessian at a critical point. When they’re all positive, we can imagine the surface as locally “bowl-shaped”, and the point is a local minimum. Likewise, when they’re all negative, the surface curves downward and we are at a local maximum. If there are a mix of positive and negative eigenvalues (all non-zero), we say we’re at a *nondegenerate* saddle point, e.g. an inflection point in the 1D case. If some eigenvalues are zero, we say we’re at a *degenerate* saddle point, which indicates a more complex loss surface, such as a monkey saddle.

Next, we have an important definition, which determines the types of critical points we’ll be working to avoid in our minimization algorithm:

Definition 1: A point is a strict saddle point of if is a critical point and (where denotes the smallest eigenvalue of ). Let denote the set of strict saddle points.

This definition actually includes local maxima. We’re dealing with a minimization algorithm, though, and our work will show that such points are unstable as well. These strict saddle points contrast with degenerate saddle points where the Hessian is positive semidefinite.

It would suck to converge to a saddle point, mainly because it’s not a minimum. In fact, they often have really bad objective values (studied here for example). Intuitively, it feels unlikely that gradient descent would converge to something in given a random initialization. Imagine a hill with a narrow plateau in the middle. We’re about to roll a ball down the hill, and we’re choosing a starting point. Sure, I can find some spots where the ball would end up at the plateau, but more often than not the ball would make its way down. This is a bit oversimplified, but the authors provide a more formal example of this phenomenon in section 3.

Generically, we’re interested in an optimization algorithm . The iterates of this algorithm are given by the sequence , where is the -fold composition of . In the case of gradient descent with step size , this is given by . We call a *fixed point* of if . For the gradient descent case, is a fixed point of if and only if is a critical point of .

Using this, we can define the set of initial points that are “bad”:

Definition 2(Global Stable Set): The global stable set of strict saddle points is the set of initial conditions where the iterates converge to a strict saddle point.

When do we avoid strict saddle points? When the probability of ending up at one is zero. In this case, we say that the set has measure zero, i.e. . We’ll use this notation for the rest of the article.

Next, we need another definition. Call the Jacobian of at , and let be its th eigenvalue. We’re going to look at the set of fixed points that are “unstable”. Intuitively, a fixed point is unstable if the function is very “curvy” in a neighborhood about . If it’s curvy, then there are a lot of directions I could take that would get me out of the neighborhood very quickly. More formally:

Definition 3(Unstable fixed points): Let be the set of fixed points where the differential has at least a single eigenvalue with magnitude greater than one. These are the unstable fixed points. That is,

Let me emphasize that here, we’re talking about the eigenvalues of the Jacobian of , not the Hessian of . gives us information about how the iterates behave near a fixed point. In fact, it’s precisely the linearization of the discrete-time dynamical system whose dynamics are governed by . That is, if I model the trajectory of a state over time by iteratively applying to , the Jacobian provides a linear approximation of this trajectory in a neighborhood of . So whenever I have eigenvalues with large magnitude, I have a subspace that provides a “line of approach” along which I’ll shoot past my fixed point, hence the term “unstable.”

To talk about the measure of these sets, the authors use something called the Stable Manifold Theorem from dynamical systems theory. I’m going to state most of its formal definition, and then work out what it means for us practically.

First, let me define a *diffeomorphism*. A function is a diffeomorphism if it has an inverse and both and its inverse are smooth. We can also have something slightly weaker called a *local diffeomorphism*. is a local diffeomorphism if for each , there is a neighborhood around such that restricted to is a diffeomorphism. That is, I can find a small area around on which I can define a smooth inverse, but I might not be able to find one inverse that works for the whole space. One example of a local diffeomorphism which isn’t a global one is which “wraps around itself” infinitely many times.

Basically, given some iterate , we might not be able to smoothly trace back to our starting position using . But if is a local diffeomorphism, we can retrace our steps in a small region near . Now, we don’t know *a priori* that this is the case for an arbitrary , but let’s see what happens when this is true:

Theorem 1(Stable Manifold Theorem): Let be a fixed point of a local diffeomorphism . Let be the span of the eigenvectors of corresponding to eigenvalues of magnitude less than or equal to one. Then there is an embedded disk tangent to at called thelocal stable center manifold. Moreover, there is a neighborhood of , such that , and .

Let’s unpack this. We’re saying that there is a stable manifold around . What do we mean by stable? Well, if we are in , we apply , and end up inside our neighborhood , then we are still in . And if we try to “retrace our steps” from something in , no matter how far back we go, we must have come from . So we’re sort of “trapped” around this fixed point. For our purposes, we know that if is a strict saddle point and is “small” enough, we shouldn’t get stuck around it. And it seems like if is an unstable fixed point, then this stable region should definitely be “small” enough. This is true since when is an unstable fixed point, has dimension less than . This implies that is an embedded disk with dimension less than , and hence has measure zero. Just imagine a flat disk in three-dimensional space. The volume of this infinitely thin disk is zero.

This seems very promising. But how do we verify that is a local diffeomorphism? Well, by the Inverse Function Theorem, we just need to check that the Jacobian of at any point is nonsingular, or equivalently that the determinant for all .

This work leads to the following theorem:

Theorem 2: Suppose for all . Then the set of initial points that end up at unstable fixed points has measure zero.

*An abbreviated proof*: For every point in , I can find a stable neighborhood (call it ) using Theorem 1. Now look at some . Then I can find some , such that for all . So for all . By Theorem 1, this set is a subset of the stable center manifold and has measure zero. Using this, we can deduce that is contained in a union of sets of measure zero, and so as desired.

This immediately leads to the corollary:

Corollary 1: Suppose every strict saddle point is an unstable fixed point, i.e. . Then the global stable set of saddle points has measure zero.

This is very cool! This means, given some optimization algorithm and an objective function , if we want to show that we don’t end up at saddle points, it suffices to verify these two properties:

- For every .
- Every strict saddle point is an unstable fixed point of .

Let’s try this out for gradient descent. To get the two desired properties, we’ll make two additional assumptions:

- The gradient of is -Lipschitz. That is, for a constant . This means for all .
- We bound our step size:

Let’s verify the first property. We can calculate the Jacobian of as:

So if the th eigenvalue of is , the corresponding eigenvalue for is . So we have that

Because we assumed and , this determinant is never 0!

Now for the second property. Consider a strict saddle point . We know that there is at least one eigenvalue of . So the corresponding eigenvalue for is . So is an unstable fixed point.

Now by applying Corollary 1, we’ve proved

Corollary 2: For gradient descent, under the Lipschitz gradient assumption with , the stable set of strict saddle points has measure zero.

There’s no reason why this proof technique should only work for gradient descent. The bulk of the rest of the paper goes into proving that these two properties hold for each of the given optimization algorithms I mentioned at the start of the post. For each algorithm, they find a set of assumptions that allow us to apply Corollary 1 to get a nice result. I found the paper very enlightening, and I would like to see more work like this with perspectives from dynamical systems theory. To finish, the authors point out some directions for future work.

An adaptive choice of the step size may still avoid saddle points. So it’s worthwhile to check if line search techniques also have the desired properties listed. For example, in backtracking line search, we adaptively select the step size based on some minimal change in objective value. Given some , start with some . Call . Then while , increment . It’d be worthwhile to find some sufficient conditions under which this also avoids saddle points.

We really made significant statements about strict saddle points. For functions which satisfy the *strict saddle point* property, i.e. every saddle point is a strict saddle point, this is even nicer. But how stable is this property? In other words, if we perturb the function a little bit, does the property still hold? This is equivalent to asking if the strict saddle point property is stable under homotopy. We could also check the stronger condition that the eigenvalues of are non-zero at critical points, which implies the strict saddle condition. For random functions in the stochastic setting, this amounts to asking questions about the density of , and dives into Morse theory.

I like the idea of understanding how stochasticity “speeds up” convergence by potentially skipping through saddle points that would normally take much longer to escape. In this paper on proving nonconvergence to saddle points for stochastic gradient descent, noise is essential to proving the desired result. More generally, it would be interesting to compare and contrast the dynamics of a deterministic optimization system with its stochastic counterpart using these techniques.

Here they’re addressing the first bullet point at the beginning of the article. If we don’t converge to saddles, how good are the local minima to which we converge? The authors point to some existing game-theoretic work on the size of the region of attraction of good local minima. It would be interesting to explore more general conditions under which these regions dominate those of bad local minima.

Written on November 17th, 2017 by Noah GolmantFeel free to share!