Escape from local minima in the gradient descent method

global optimizationgradient descentnumerical methodsnumerical optimizationoptimization

I'm using gradient descent $$x_i=x_{i-1}-\gamma\nabla f(x_{i-1})\tag1$$ to minimize a function $f$. Now, I've observed that I got stuck in a local minimum when $x_0$ was chosen in an unlucky way.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

I've come up with a simple idea, which actually works, but I guess it is quite inefficient. What I'm doing is replacing $(1)$ by $$x_i=x_{i-1}-\gamma_i\frac{\nabla f(x_{i-1})}{\left\|\nabla f(x_{i-1})\right\|}\tag2,$$ where $\gamma_i=\gamma_{i-1}/2$ and $\gamma_0=1$. Now, after every iteration $i$, I choose randomly a $\tilde\gamma$ in $[0,1)$, compute $$\tilde x_i=x_i-\tilde\gamma\frac{\nabla f(x_i)}{\left\|\nabla f(x_i)\right\|}\tag3$$ and if $f(\tilde x_i)<f(x_i)$, then I set $\gamma_i=\tilde\gamma$ and $x_i=\tilde x_i$.

I'm sure there is a smarter way.

Best Answer

There are plenty of solutions to this very problem being developed because of the usage of gradient descent methods to optimize neural networks.

Depending on the dimension of your state space $x$ there are smarter or more straight forward ways to do it.

If the dimension of $x$ is low, you can do the brute-force search on the grid surrounding the point you are getting stuck around. How big and dense the grid has to be is dependent on the regularity of the function $f$.

If the dimension is greater than 1 you probably also want to escape the saddle points therefore you should analyse the eigenvalues of a Hessian matrix of $f$ nearby the point you are being stuck.

That being said, if the dimensionality of your problem is low, there are more theoretically sound, higher than order 1 optimization techniques which have some theoretical guarantees on convergence and work better than plain gradient descent.

If the dimensionality of your problem is high, you cannot afford the computation of Hessian due to its quadratic complexity. In this case we want to restrict ourselves to first order optimization techniques. The most popular choices here are SGD, Adam and RAdam. If you are concerned about saddle points you can add noise at each iteration and you can do it smartly as presented in the works of M. Jordan.

As a rule of thumb, if you are looking for good enough solution to your problem and the dimensionality of $x$ is absurdly high, then go and read up on some Deep Learning literature. If you want any theoretical guarantees of your optimization algorithm and the dimensionality of $x$ is still pretty high then read the Physics literature. If you have no margin of error and the dimension of $x$ is low then read the Mathematical literature.

The topic is vast and I barely scratched the surface here. I hope you can research the best suited solution for your problem yourself now, because stated as is there is no good answer really.

Related Question