Noah Golmant

First-order Methods Almost Always Avoid Saddle Points

The title of this post is the name of a recently uploaded arxiv preprint by Lee et al. This work is by the same researchers who wrote Gradient Descent Converges to Minimizers (in COLT 2016), as well as some folks who followed up on that work by answering some open questions about step size upper-bounds and non-isolated critical points. The original paper focused on the convergence of vanilla gradient descent, while this new submission significantly expands that work to prove convergence properties of the proximal point algorithm, (block) coordinate descent, manifold gradient descent, and mirror descent.

There’s been a lot of recent work on understanding why first-order descent methods work so well in so many non-convex settings. A lot of this has been inspired by stochastic gradient descent’s huge success in deep learning, where the loss surfaces are highly non-convex. Some big-shot questions include:

This paper is mainly looking at the last point. In particular, it’s asking about the likelihood that first-order methods converge to saddle points. As my first blog post, I chose this paper because I found its proof techniques elegant and intuitive. The main contribution is clearly marked out in the very first line of the abstract:

We establish that first-order methods avoid saddle points for almost all initializations.

Very cool! Let’s dive into what this means. For simplicity’s sake, I’m going to outline their work on gradient descent, and then towards the end, I’ll talk about what conditions they prove for the other first-order methods.

Setup

We’ll be working in the domain X=Rd\mathcal{X} = \mathbb{R}^d , with a twice-continuously differentiable objective function f:XR\hspace{.5ex} f: \mathcal{X} \rightarrow \mathbb{R}. We denote the gradient as f(x)\nabla f(x) and the Hessian as 2f(x)\nabla^2 f(x). Our goal is to find a minimizer of ff. Recall that a point xXx^* \in \mathcal{X} is called a critical point of ff if f(x)=0\nabla f(x^*) = 0.

Let’s recall some basic properties about the eigenvalues of the Hessian at a critical point. When they’re all positive, we can imagine the surface as locally “bowl-shaped”, and the point is a local minimum. Likewise, when they’re all negative, the surface curves downward and we are at a local maximum. If there are a mix of positive and negative eigenvalues (all non-zero), we say we’re at a nondegenerate saddle point, e.g. an inflection point in the 1D case. If some eigenvalues are zero, we say we’re at a degenerate saddle point, which indicates a more complex loss surface, such as a monkey saddle.

Next, we have an important definition, which determines the types of critical points we’ll be working to avoid in our minimization algorithm:

Definition 1: A point xx^* is a strict saddle point of ff if xx^* is a critical point and λmin(2f(x))<0\lambda_{\min}(\nabla^2 f(x^*)) < 0 (where \lambda_\min(A) denotes the smallest eigenvalue of AA). Let X\mathcal{X}^* denote the set of strict saddle points.

This definition actually includes local maxima. We’re dealing with a minimization algorithm, though, and our work will show that such points are unstable as well. These strict saddle points contrast with degenerate saddle points where the Hessian is positive semidefinite.

It would suck to converge to a saddle point, mainly because it’s not a minimum. In fact, they often have really bad objective values (studied here for example). Intuitively, it feels unlikely that gradient descent would converge to something in X\mathcal{X}^* given a random initialization. Imagine a hill with a narrow plateau in the middle. We’re about to roll a ball down the hill, and we’re choosing a starting point. Sure, I can find some spots where the ball would end up at the plateau, but more often than not the ball would make its way down. This is a bit oversimplified, but the authors provide a more formal example of this phenomenon in section 3.

Generically, we’re interested in an optimization algorithm g:XXg: \mathcal{X} \rightarrow \mathcal{X}. The iterates of this algorithm are given by the sequence xk=g(xk1)=gk(x0)x_{k} = g(x_{k-1}) = g^k(x_0) , where gkg^k is the kk-fold composition of gg. In the case of gradient descent with step size α\alpha, this is given by g(x)=xαf(x)g(x) = x - \alpha \nabla f(x). We call xXx^* \in \mathcal{X} a fixed point of gg if g(x)=xg(x^*) = x^*. For the gradient descent case, xx^* is a fixed point of gg if and only if xx^* is a critical point of ff.

Using this, we can define the set of initial points that are “bad”:

Definition 2 (Global Stable Set): The global stable set of strict saddle points is the set WgW_g of initial conditions where the iterates converge to a strict saddle point.

Wg={x0 ⁣:limkgk(x0)X}\begin{align} W_g &= \{ x_0 \colon \underset{k}{\lim} g^k(x_0) \in \mathcal{X}^* \} \end{align}

When do we avoid strict saddle points? When the probability of ending up at one is zero. In this case, we say that the set WgW_g has measure zero, i.e. μ(Wg)=0\mu(W_g) = 0. We’ll use this μ(A)=0\mu(A) = 0 notation for the rest of the article.

Next, we need another definition. Call Dg(x)Dg(x) the Jacobian of gg at xx, and let λi(Dg(x))\lambda_i(Dg(x)) be its iith eigenvalue. We’re going to look at the set of fixed points that are “unstable”. Intuitively, a fixed point xx^* is unstable if the function is very “curvy” in a neighborhood about xx^*. If it’s curvy, then there are a lot of directions I could take that would get me out of the neighborhood very quickly. More formally:

Definition 3 (Unstable fixed points): Let Ag\mathcal{A}_g^* be the set of fixed points where the differential has at least a single eigenvalue with magnitude greater than one. These are the unstable fixed points. That is,

Ag={x ⁣:g(x)=x,maxiλi(Dg(x))>1}\begin{align} \mathcal{A}_g^* = \{ x \colon g(x) = x, \underset{i}{\max} |\lambda_i(Dg(x))| > 1 \} \end{align}

Let me emphasize that here, we’re talking about the eigenvalues of the Jacobian of gg, not the Hessian of ff. Dg(x)Dg(x) gives us information about how the iterates behave near a fixed point. In fact, it’s precisely the linearization of the discrete-time dynamical system whose dynamics are governed by gg. That is, if I model the trajectory of a state xx over time by iteratively applying gg to xx, the Jacobian provides a linear approximation of this trajectory in a neighborhood of xx. So whenever I have eigenvalues with large magnitude, I have a subspace that provides a “line of approach” along which I’ll shoot past my fixed point, hence the term “unstable.”

The Stable Manifold Theorem: what is it?

To talk about the measure of these sets, the authors use something called the Stable Manifold Theorem from dynamical systems theory. I’m going to state most of its formal definition, and then work out what it means for us practically.

First, let me define a diffeomorphism. A function ff is a diffeomorphism if it has an inverse and both ff and its inverse are smooth. We can also have something slightly weaker called a local diffeomorphism. f:XY\hspace{.75ex} f: X \rightarrow Y is a local diffeomorphism if for each xXx \in X, there is a neighborhood UU around xx such that ff restricted to UU is a diffeomorphism. That is, I can find a small area around xx on which I can define a smooth inverse, but I might not be able to find one inverse that works for the whole space. One example of a local diffeomorphism which isn’t a global one is f(t)=(cost,sint)f(t) = (\cos t, \sin t) which “wraps around itself” infinitely many times.

Basically, given some iterate xkx_k, we might not be able to smoothly trace back to our starting position x0x_0 using gg. But if gg is a local diffeomorphism, we can retrace our steps in a small region near xkx_k. Now, we don’t know a priori that this is the case for an arbitrary gg, but let’s see what happens when this is true:

Theorem 1 (Stable Manifold Theorem): Let xx^* be a fixed point of a local diffeomorphism gg. Let EsE_s be the span of the eigenvectors of Dg(x)Dg(x^*) corresponding to eigenvalues of magnitude less than or equal to one. Then there is an embedded disk WloccsW_{loc}^{cs} tangent to EsE_s at xx^* called the local stable center manifold. Moreover, there is a neighborhood BB of xx^*, such that g(Wloccs)BWloccsg(W_{loc}^{cs}) \cap B \subset W_{loc}^{cs}, and k=0gk(B)Wloccs\cap_{k=0}^\infty g^{-k}(B) \subset W_{loc}^{cs}.

Let’s unpack this. We’re saying that there is a stable manifold WloccsW_{loc}^{cs} around xx^*. What do we mean by stable? Well, if we are in WloccsW_{loc}^{cs}, we apply gg, and end up inside our neighborhood BB, then we are still in WloccsW_{loc}^{cs}. And if we try to “retrace our steps” from something in BB, no matter how far back we go, we must have come from WloccsW_{loc}^{cs}. So we’re sort of “trapped” around this fixed point. For our purposes, we know that if xx^* is a strict saddle point and WloccsW_{loc}^{cs} is “small” enough, we shouldn’t get stuck around it. And it seems like if xx^* is an unstable fixed point, then this stable region should definitely be “small” enough. This is true since when xx^* is an unstable fixed point, EsE_s has dimension less than dd. This implies that WloccsW_{loc}^{cs} is an embedded disk with dimension less than dd, and hence has measure zero. Just imagine a flat disk in three-dimensional space. The volume of this infinitely thin disk is zero.

The Stable Manifold Theorem: how to use it

This seems very promising. But how do we verify that gg is a local diffeomorphism? Well, by the Inverse Function Theorem, we just need to check that the Jacobian of gg at any point xx is nonsingular, or equivalently that the determinant det(Dg(x))0\text{det}(Dg(x)) \neq 0 for all xXx \in \mathcal{X}.

This work leads to the following theorem:

Theorem 2: Suppose det(Dg(x))0\text{det}(Dg(x)) \neq 0 for all xXx \in \mathcal{X}. Then the set of initial points that end up at unstable fixed points has measure zero.

μ({x0 ⁣:limxkAg})=0\mu(\{ x_0 \colon \lim x_k \in \mathcal{A}_g^* \}) = 0

An abbreviated proof: For every point xx^* in Ag\mathcal{A}_g^*, I can find a stable neighborhood (call it BxB_{x^*}) using Theorem 1. Now look at some x0{x0 ⁣:limxkAg}:=Wx_0 \in \{ x_0 \colon \lim x_k \in \mathcal{A}_g^* \} := W. Then I can find some T0T \geq 0, xAgx^* \in \mathcal{A}_g^* such that gt(x0)Bxg^t(x_0) \in B_{x^*} for all tTt \geq T. So gt(x0)k=0gk(Bx)g^t(x_0) \in \cap_{k=0}^\infty g^{-k}(B_{x^*}) for all tTt \geq T. By Theorem 1, this set is a subset of the stable center manifold and has measure zero. Using this, we can deduce that WW is contained in a union of sets of measure zero, and so μ(W)=0\mu(W) = 0 as desired. \square

This immediately leads to the corollary:

Corollary 1: Suppose every strict saddle point is an unstable fixed point, i.e. XAg\mathcal{X}^* \subset \mathcal{A}_g^*. Then the global stable set of saddle points WgW_g has measure zero.

This is very cool! This means, given some optimization algorithm gg and an objective function ff, if we want to show that we don’t end up at saddle points, it suffices to verify these two properties:

An application: Gradient Descent

Let’s try this out for gradient descent. To get the two desired properties, we’ll make two additional assumptions:

Let’s verify the first property. We can calculate the Jacobian of gg as:

Dg(x)=Iα2f(x)\begin{align} Dg(x) = I - \alpha \nabla^2 f(x) \end{align}

So if the iith eigenvalue of 2f(x)\nabla^2 f(x) is λi\lambda_i, the corresponding eigenvalue for Dg(x)Dg(x) is 1αλi1 - \alpha \lambda_i. So we have that

det(Dg(x))=Πi(1αλi)\begin{align} \text{det}(Dg(x)) = \Pi_i (1 - \alpha \lambda_i) \end{align}

Because we assumed λiL\lambda _i \leq L and α<1/L\alpha < 1/L, this determinant is never 0!

Now for the second property. Consider a strict saddle point xx^*. We know that there is at least one eigenvalue λ<0\lambda < 0 of 2f(x)\nabla^2 f(x^*). So the corresponding eigenvalue for Dg(x)Dg(x^*) is 1αλ>11 - \alpha \lambda > 1. So xx^* is an unstable fixed point.

Now by applying Corollary 1, we’ve proved

Corollary 2: For gradient descent, under the Lipschitz gradient assumption with 0<α<1/L0 < \alpha < 1/L, the stable set of strict saddle points has measure zero.

Conclusion, open questions

There’s no reason why this proof technique should only work for gradient descent. The bulk of the rest of the paper goes into proving that these two properties hold for each of the given optimization algorithms I mentioned at the start of the post. For each algorithm, they find a set of assumptions that allow us to apply Corollary 1 to get a nice result. I found the paper very enlightening, and I would like to see more work like this with perspectives from dynamical systems theory. To finish, the authors point out some directions for future work.

An adaptive choice of the step size may still avoid saddle points. So it’s worthwhile to check if line search techniques also have the desired properties listed. For example, in backtracking line search, we adaptively select the step size based on some minimal change in objective value. Given some 0<β<10 < \beta < 1, start with some α0>0,t=1\alpha_0 > 0, t=1. Call αt=βtα0\alpha_t = \beta^t \alpha_0. Then while f(xαtf(x))>f(x)αt2f(x)2f(x - \alpha_t \nabla f(x)) > f(x) - \frac{\alpha_t}{2}\|\nabla f(x)\|^2, increment tt. It’d be worthwhile to find some sufficient conditions under which this also avoids saddle points.

Strict saddles

We really made significant statements about strict saddle points. For functions which satisfy the strict saddle point property, i.e. every saddle point is a strict saddle point, this is even nicer. But how stable is this property? In other words, if we perturb the function a little bit, does the property still hold? This is equivalent to asking if the strict saddle point property is stable under homotopy. We could also check the stronger condition that the eigenvalues of 2f(x)\nabla^2 f(x) are non-zero at critical points, which implies the strict saddle condition. For random functions in the stochastic setting, this amounts to asking questions about the density of 2f(x)\nabla^2 f(x), and dives into Morse theory.

Speed of convergence

I like the idea of understanding how stochasticity “speeds up” convergence by potentially skipping through saddle points that would normally take much longer to escape. In this paper on proving nonconvergence to saddle points for stochastic gradient descent, noise is essential to proving the desired result. More generally, it would be interesting to compare and contrast the dynamics of a deterministic optimization system with its stochastic counterpart using these techniques.

Beyond saddle points

Here they’re addressing the first bullet point at the beginning of the article. If we don’t converge to saddles, how good are the local minima to which we converge? The authors point to some existing game-theoretic work on the size of the region of attraction of good local minima. It would be interesting to explore more general conditions under which these regions dominate those of bad local minima.