The title of this post is the name of a recently uploaded arxiv preprint by Lee et al. This work is by the same researchers who wrote Gradient Descent Converges to Minimizers (in COLT 2016), as well as some folks who followed up on that work by answering some open questions about step size upper-bounds and non-isolated critical points. The original paper focused on the convergence of vanilla gradient descent, while this new submission significantly expands that work to prove convergence properties of the proximal point algorithm, (block) coordinate descent, manifold gradient descent, and mirror descent.

There’s been a lot of recent work on understanding why first-order descent methods work so well in so many non-convex settings. A lot of this has been inspired by stochastic gradient descent’s huge success in deep learning, where the loss surfaces are highly non-convex. Some big-shot questions include:

- How good are local minima on these loss surfaces?
- How prevalent are saddle points on these loss surfaces?
- Are second-order methods very useful in this setting? Are first-order methods good enough?
- What kind of critical points does gradient descent (or some other first-order method) find?

This paper is mainly looking at the last point. In particular, it’s asking about the likelihood that first-order methods converge to saddle points. As my first blog post, I chose this paper because I found its proof techniques elegant and intuitive. The main contribution is clearly marked out in the very first line of the abstract:

We establish that first-order methods avoid saddle points for almost all initializations.

Very cool! Let’s dive into what this means. For simplicity’s sake, I’m going to outline their work on gradient descent, and then towards the end, I’ll talk about what conditions they prove for the other first-order methods.

We’ll be working in the domain \(\mathcal{X} = \mathbb{R}^d\), with a twice-continuously differentiable objective function \(\hspace{.5ex} f: \mathcal{X} \rightarrow \mathbb{R}\). We denote the gradient as \(\nabla f(x)\) and the Hessian as \(\nabla^2 f(x)\). Our goal is to find a minimizer of \(f\). Recall that a point \(x^* \in \mathcal{X}\) is called a critical point of \(f\) if \(\nabla f(x^*) = 0\).

Let’s recall some basic properties about the eigenvalues of the Hessian at a critical point. When they’re all positive, we can imagine the surface as locally “bowl-shaped”, and the point is a local minimum. Likewise, when they’re all negative, the surface curves downward and we are at a local maximum. If there are a mix of positive and negative eigenvalues (all non-zero), we say we’re at a *nondegenerate* saddle point, e.g. an inflection point in the 1D case. If some eigenvalues are zero, we say we’re at a *degenerate* saddle point, which indicates a more complex loss surface, such as a monkey saddle.

Next, we have an important definition, which determines the types of critical points we’ll be working to avoid in our minimization algorithm:

Definition 1: A point \(x^*\) is a strict saddle point of \(f\) if \(x^*\) is a critical point and \(\lambda_{\min}(\nabla^2 f(x^*)) < 0\) (where \(\lambda_\min(A)\) denotes the smallest eigenvalue of \(A\)). Let \(\mathcal{X}^*\) denote the set of strict saddle points.

This definition actually includes local maxima. We’re dealing with a minimization algorithm, though, and our work will show that such points are unstable as well. These strict saddle points contrast with degenerate saddle points where the Hessian is positive semidefinite.

It would suck to converge to a saddle point, mainly because it’s not a minimum. In fact, they often have really bad objective values (studied here for example). Intuitively, it feels unlikely that gradient descent would converge to something in \(\mathcal{X}^*\) given a random initialization. Imagine a hill with a narrow plateau in the middle. We’re about to roll a ball down the hill, and we’re choosing a starting point. Sure, I can find some spots where the ball would end up at the plateau, but more often than not the ball would make its way down. This is a bit oversimplified, but the authors provide a more formal example of this phenomenon in section 3.

Generically, we’re interested in an optimization algorithm \(g: \mathcal{X} \rightarrow \mathcal{X}\). The iterates of this algorithm are given by the sequence \(x_{k} = g(x_{k-1}) = g^k(x_0)\), where \(g^k\) is the \(k\)-fold composition of \(g\). In the case of gradient descent with step size \(\alpha\), this is given by \(g(x) = x - \alpha \nabla f(x)\). We call \(x^* \in \mathcal{X}\) a *fixed point* of \(g\) if \(g(x^*) = x^*\). For the gradient descent case, \(x^*\) is a fixed point of \(g\) if and only if \(x^*\) is a critical point of \(f\).

Using this, we can define the set of initial points that are “bad”:

\[\begin{align} W_g &= \{ x_0 \colon \underset{k}{\lim} g^k(x_0) \in \mathcal{X}^* \} \end{align}\]

Definition 2(Global Stable Set): The global stable set of strict saddle points is the set \(W_g\) of initial conditions where the iterates converge to a strict saddle point.

When do we avoid strict saddle points? When the probability of ending up at one is zero. In this case, we say that the set \(W_g\) has measure zero, i.e. \(\mu(W_g) = 0\). We’ll use this \(\mu(A) = 0\) notation for the rest of the article.

Next, we need another definition. Call \(Dg(x)\) the Jacobian of \(g\) at \(x\), and let \(\lambda_i(Dg(x))\) be its \(i\)th eigenvalue. We’re going to look at the set of fixed points that are “unstable”. Intuitively, a fixed point \(x^*\) is unstable if the function is very “curvy” in a neighborhood about \(x^*\). If it’s curvy, then there are a lot of directions I could take that would get me out of the neighborhood very quickly. More formally:

\[\begin{align} \mathcal{A}_g^* = \{ x \colon g(x) = x, \underset{i}{\max} |\lambda_i(Dg(x))| > 1 \} \end{align}\]

Definition 3(Unstable fixed points): Let \(\mathcal{A}_g^*\) be the set of fixed points where the differential has at least a single eigenvalue with magnitude greater than one. These are the unstable fixed points. That is,

Let me emphasize that here, we’re talking about the eigenvalues of the Jacobian of \(g\), not the Hessian of \(f\). \(Dg(x)\) gives us information about how the iterates behave near a fixed point. In fact, it’s precisely the linearization of the discrete-time dynamical system whose dynamics are governed by \(g\). That is, if I model the trajectory of a state \(x\) over time by iteratively applying \(g\) to \(x\), the Jacobian provides a linear approximation of this trajectory in a neighborhood of \(x\). So whenever I have eigenvalues with large magnitude, I have a subspace that provides a “line of approach” along which I’ll shoot past my fixed point, hence the term “unstable.”

To talk about the measure of these sets, the authors use something called the Stable Manifold Theorem from dynamical systems theory. I’m going to state most of its formal definition, and then work out what it means for us practically.

First, let me define a *diffeomorphism*. A function \(f\) is a diffeomorphism if it has an inverse and both \(f\) and its inverse are smooth. We can also have something slightly weaker called a *local diffeomorphism*. \(\hspace{.75ex} f: X \rightarrow Y\) is a local diffeomorphism if for each \(x \in X\), there is a neighborhood \(U\) around \(x\) such that \(f\) restricted to \(U\) is a diffeomorphism. That is, I can find a small area around \(x\) on which I can define a smooth inverse, but I might not be able to find one inverse that works for the whole space. One example of a local diffeomorphism which isn’t a global one is \(f(t) = (\cos t, \sin t)\) which “wraps around itself” infinitely many times.

Basically, given some iterate \(x_k\), we might not be able to smoothly trace back to our starting position \(x_0\) using \(g\). But if \(g\) is a local diffeomorphism, we can retrace our steps in a small region near \(x_k\). Now, we don’t know *a priori* that this is the case for an arbitrary \(g\), but let’s see what happens when this is true:

Theorem 1(Stable Manifold Theorem): Let \(x^*\) be a fixed point of a local diffeomorphism \(g\). Let \(E_s\) be the span of the eigenvectors of \(Dg(x^*)\) corresponding to eigenvalues of magnitude less than or equal to one. Then there is an embedded disk \(W_{loc}^{cs}\) tangent to \(E_s\) at \(x^*\) called thelocal stable center manifold. Moreover, there is a neighborhood \(B\) of \(x^*\), such that \(g(W_{loc}^{cs}) \cap B \subset W_{loc}^{cs}\), and \(\cap_{k=0}^\infty g^{-k}(B) \subset W_{loc}^{cs}\).

Let’s unpack this. We’re saying that there is a stable manifold \(W_{loc}^{cs}\) around \(x^*\). What do we mean by stable? Well, if we are in \(W_{loc}^{cs}\), we apply \(g\), and end up inside our neighborhood \(B\), then we are still in \(W_{loc}^{cs}\). And if we try to “retrace our steps” from something in \(B\), no matter how far back we go, we must have come from \(W_{loc}^{cs}\). So we’re sort of “trapped” around this fixed point. For our purposes, we know that if \(x^*\) is a strict saddle point and \(W_{loc}^{cs}\) is “small” enough, we shouldn’t get stuck around it. And it seems like if \(x^*\) is an unstable fixed point, then this stable region should definitely be “small” enough. This is true since when \(x^*\) is an unstable fixed point, \(E_s\) has dimension less than \(d\). This implies that \(W_{loc}^{cs}\) is an embedded disk with dimension less than \(d\), and hence has measure zero. Just imagine a flat disk in three-dimensional space. The volume of this infinitely thin disk is zero.

This seems very promising. But how do we verify that \(g\) is a local diffeomorphism? Well, by the Inverse Function Theorem, we just need to check that the Jacobian of \(g\) at any point \(x\) is nonsingular, or equivalently that the determinant \(\text{det}(Dg(x)) \neq 0\) for all \(x \in \mathcal{X}\).

This work leads to the following theorem:

\[\mu(\{ x_0 \colon \lim x_k \in \mathcal{A}_g^* \}) = 0\]

Theorem 2: Suppose \(\text{det}(Dg(x)) \neq 0\) for all \(x \in \mathcal{X}\). Then the set of initial points that end up at unstable fixed points has measure zero.

*An abbreviated proof*: For every point \(x^*\) in \(\mathcal{A}_g^*\), I can find a stable neighborhood (call it \(B_{x^*}\)) using Theorem 1. Now look at some \(x_0 \in \{ x_0 \colon \lim x_k \in \mathcal{A}_g^* \} := W\). Then I can find some \(T \geq 0\), \(x^* \in \mathcal{A}_g^*\) such that \(g^t(x_0) \in B_{x^*}\) for all \(t \geq T\). So \(g^t(x_0) \in \cap_{k=0}^\infty g^{-k}(B_{x^*})\) for all \(t \geq T\). By Theorem 1, this set is a subset of the stable center manifold and has measure zero. Using this, we can deduce that \(W\) is contained in a union of sets of measure zero, and so \(\mu(W) = 0\) as desired. \(\square\)

This immediately leads to the corollary:

Corollary 1: Suppose every strict saddle point is an unstable fixed point, i.e. \(\mathcal{X}^* \subset \mathcal{A}_g^*\). Then the global stable set of saddle points \(W_g\) has measure zero.

This is very cool! This means, given some optimization algorithm \(g\) and an objective function \(f\), if we want to show that we don’t end up at saddle points, it suffices to verify these two properties:

- For every \(x \in \mathcal{X}, \text{det}(Dg(x)) \neq 0\).
- Every strict saddle point \(x^*\) is an unstable fixed point of \(g\).

Let’s try this out for gradient descent. To get the two desired properties, we’ll make two additional assumptions:

- The gradient of \(f\) is \(L\)-Lipschitz. That is, \(\|\nabla^2 f(x)\| \leq L\) for a constant \(L > 0\). This means \(\lambda_{\max}(\nabla^2 f(x)) \leq L\) for all \(x \in \mathcal{X}\).
- We bound our step size: \(0 < \alpha < 1/L\)

Let’s verify the first property. We can calculate the Jacobian of \(g\) as:

\[\begin{align} Dg(x) = I - \alpha \nabla^2 f(x) \end{align}\]So if the \(i\)th eigenvalue of \(\nabla^2 f(x)\) is \(\lambda_i\), the corresponding eigenvalue for \(Dg(x)\) is \(1 - \alpha \lambda_i\). So we have that

\[\begin{align} \text{det}(Dg(x)) = \Pi_i (1 - \alpha \lambda_i) \end{align}\]Because we assumed \(\lambda _i \leq L\) and \(\alpha < 1/L\), this determinant is never 0!

Now for the second property. Consider a strict saddle point \(x^*\). We know that there is at least one eigenvalue \(\lambda < 0\) of \(\nabla^2 f(x^*)\). So the corresponding eigenvalue for \(Dg(x^*)\) is \(1 - \alpha \lambda > 1\). So \(x^*\) is an unstable fixed point.

Now by applying Corollary 1, we’ve proved

Corollary 2: For gradient descent, under the Lipschitz gradient assumption with \(0 < \alpha < 1/L\), the stable set of strict saddle points has measure zero.

There’s no reason why this proof technique should only work for gradient descent. The bulk of the rest of the paper goes into proving that these two properties hold for each of the given optimization algorithms I mentioned at the start of the post. For each algorithm, they find a set of assumptions that allow us to apply Corollary 1 to get a nice result. I found the paper very enlightening, and I would like to see more work like this with perspectives from dynamical systems theory. To finish, the authors point out some directions for future work.

An adaptive choice of the step size may still avoid saddle points. So it’s worthwhile to check if line search techniques also have the desired properties listed. For example, in backtracking line search, we adaptively select the step size based on some minimal change in objective value. Given some \(0 < \beta < 1\), start with some \(\alpha_0 > 0, t=1\). Call \(\alpha_t = \beta^t \alpha_0\). Then while \(f(x - \alpha_t \nabla f(x)) > f(x) - \frac{\alpha_t}{2}\|\nabla f(x)\|^2\), increment \(t\). It’d be worthwhile to find some sufficient conditions under which this also avoids saddle points.

We really made significant statements about strict saddle points. For functions which satisfy the *strict saddle point* property, i.e. every saddle point is a strict saddle point, this is even nicer. But how stable is this property? In other words, if we perturb the function a little bit, does the property still hold? This is equivalent to asking if the strict saddle point property is stable under homotopy. We could also check the stronger condition that the eigenvalues of \(\nabla^2 f(x)\) are non-zero at critical points, which implies the strict saddle condition. For random functions in the stochastic setting, this amounts to asking questions about the density of \(\nabla^2 f(x)\), and dives into Morse theory.

I like the idea of understanding how stochasticity “speeds up” convergence by potentially skipping through saddle points that would normally take much longer to escape. In this paper on proving nonconvergence to saddle points for stochastic gradient descent, noise is essential to proving the desired result. More generally, it would be interesting to compare and contrast the dynamics of a deterministic optimization system with its stochastic counterpart using these techniques.

Here they’re addressing the first bullet point at the beginning of the article. If we don’t converge to saddles, how good are the local minima to which we converge? The authors point to some existing game-theoretic work on the size of the region of attraction of good local minima. It would be interesting to explore more general conditions under which these regions dominate those of bad local minima.

Written on November 17th, 2017 by Noah GolmantFeel free to share!