Solved – Gradient descent on non-convex functions

gradientgradient descentnon-convexstochastic gradient descent

What situations do we know of where gradient descent can be shown to converge (either to a critical point or to a local/global minima) for non-convex functions?


For SGD on non-convex functions, one kind of proof has been reviewed here,
http://www.cs.cornell.edu/courses/cs6787/2017fa/Lecture7.pdf

Best Answer

In this answer I will explore two interesting and relevant papers that were brought up in the comments. Before doing so, I will attempt to formalize the problem and to shed some light on some of the assumptions and definitions. I begin with a 2016 paper by Lee et al.

We seek to minimize a non-convex function $f: \mathbb{R}^d \to \mathbb{R}$ that is bounded below. We require it to be twice differentiable. We use a gradient descent algorithm of the form:

$\pmb{x}_{t+1} = \pmb{x}_t - \alpha\nabla f(\pmb{x}_t)$.

Additionally, we have the following requirement:

$\| \nabla f(\pmb{x}_1)-\nabla f(\pmb{x}_2) \| \leq \ell \| \pmb{x}_1 - \pmb{x}_2 \|, \quad \text{for all } \pmb{x}_1, \pmb{x}_2$.

That is, we require our function to be $\ell$-Lipschitz in its first derivative. In english this translates to the idea that our gradient can not change too rapidly anywhere in the domain. This assumption ensures that we can choose a step-size such that we never end up with steps that diverge.

Recall that a point $\pmb{x}$ is said to be a strict saddle if $\nabla f(\pmb{x}) = 0$ and $\lambda_{\min}\left(\nabla^2 f(\pmb{x})\right) < 0$ and $\lambda_{\max}\left(\nabla^2 f(\pmb{x})\right) > 0$. If all of the eigenvalues of the Hessian have the same sign then the point is a minimum (if they're positive) or a maximum (if they're negative). If there are any 0 eigenvalues then it is said to be degenerate, and it is not a strict saddle.

The paper shows that with the above assumptions, along with the assumption that all saddle points of the function are strict-saddle, gradient descent is guaranteed to converge to a minimum.

The proof is quite technical, but the intuition is this: define a set $W^s(\pmb{x}^s) = \{\pmb{x} : \lim_k g^k(\pmb{x}) = \pmb{x}^s \}$, where $\pmb{x}^s$ is a saddle point. I don't like this notation at all. What they are trying to get at is that $W$ is the set of starting values for which the gradient map $g : \mathbb{R}^d \to \mathbb{R}^d$ sends $\pmb{x}_k$ to $\pmb{x}^s$. Put more plainly, it is the set of random initializations that will ultimately converge to a saddle.

Their argument relies on the Stable Manifold Theorem. With the above assumptions and a bunch of esoteric math they conclude that the set $W^s$ must be measure zero, that is, there is zero probability of randomly initializing on a point that will converge to a saddle point. As we know that gradient descent on functions of the type outlined in the assumptions with suitably small step sizes will eventually reach a critical point, and we now know (almost surely) that it will never land on a saddle, we know that it converges to a minimizer.

The second, more recent paper by Reddi et al. I will discuss in less detail. There are several differences. First, they are no longer working in a deterministic framework, instead opting for the more practically relevant stochastic approximation framework on a finite sum (think Stochastic Gradient Descent). The primary differences there are that the step-size requires some additional care, and the gradient becomes a random variable. Additionally, they relax the assumption that all saddles are strict, and look for a second-order stationary point. That is, a point such that, $ \|\nabla(f) \| \leq \epsilon, \quad \text{and}, \quad \lambda_{\min}\left(\nabla^2 f(\pmb{x})\right)\geq -\sqrt{\rho\epsilon}$

Where $\rho$ is the Lipschitz constant for the Hessian. (That is, in addition to the requirement that our gradient not vary too rapidly, we now have a similar requirement on our Hessian. Essentially, the authors are looking for a point which looks like a minimum in both the first and second derivative.

The method by which they accomplish this is to use a variant (pick your favorite) of stochastic gradient descent most of the time. But wherever they encounter a point where $\lambda_{\min}\left(\nabla^2 f(\pmb{x})\right)\leq 0$, they use a suitably chosen second order method to escape the saddle. They show that by incorporating this second order information as-needed they will converge to a second-order stationary point.

Technically this is a second order gradient method, which may or may not fall under the umbrella of algorithms you were interested in.

This is a very active area of research and I've left out many important contributions (ex Ge et al.). I'm also new to the topic so this question has provided me an opportunity to look. I'm happy to continue the discussion if there is interest.

*** Suitably chosen means one one that is shown to converge to a second-order stationary point. They use the Cubic regularized Newton method of Nesterov and Polyak.

Related Question