Escape from local minima in the gradient descent method

global optimizationgradient descentnumerical methodsnumerical optimizationoptimization

I'm using gradient descent $$x_i=x_{i-1}-\gamma\nabla f(x_{i-1})\tag1$$ to minimize a function $f$. Now, I've observed that I got stuck in a local minimum when $x_0$ was chosen in an unlucky way.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

I've come up with a simple idea, which actually works, but I guess it is quite inefficient. What I'm doing is replacing $(1)$ by $$x_i=x_{i-1}-\gamma_i\frac{\nabla f(x_{i-1})}{\left\|\nabla f(x_{i-1})\right\|}\tag2,$$ where $\gamma_i=\gamma_{i-1}/2$ and $\gamma_0=1$. Now, after every iteration $i$, I choose randomly a $\tilde\gamma$ in $[0,1)$, compute $$\tilde x_i=x_i-\tilde\gamma\frac{\nabla f(x_i)}{\left\|\nabla f(x_i)\right\|}\tag3$$ and if $f(\tilde x_i)<f(x_i)$, then I set $\gamma_i=\tilde\gamma$ and $x_i=\tilde x_i$.

I'm sure there is a smarter way.

Best Answer

There are plenty of solutions to this very problem being developed because of the usage of gradient descent methods to optimize neural networks.

Depending on the dimension of your state space $x$ there are smarter or more straight forward ways to do it.

If the dimension of $x$ is low, you can do the brute-force search on the grid surrounding the point you are getting stuck around. How big and dense the grid has to be is dependent on the regularity of the function $f$.

If the dimension is greater than 1 you probably also want to escape the saddle points therefore you should analyse the eigenvalues of a Hessian matrix of $f$ nearby the point you are being stuck.

That being said, if the dimensionality of your problem is low, there are more theoretically sound, higher than order 1 optimization techniques which have some theoretical guarantees on convergence and work better than plain gradient descent.

If the dimensionality of your problem is high, you cannot afford the computation of Hessian due to its quadratic complexity. In this case we want to restrict ourselves to first order optimization techniques. The most popular choices here are SGD, Adam and RAdam. If you are concerned about saddle points you can add noise at each iteration and you can do it smartly as presented in the works of M. Jordan.

As a rule of thumb, if you are looking for good enough solution to your problem and the dimensionality of $x$ is absurdly high, then go and read up on some Deep Learning literature. If you want any theoretical guarantees of your optimization algorithm and the dimensionality of $x$ is still pretty high then read the Physics literature. If you have no margin of error and the dimension of $x$ is low then read the Mathematical literature.

The topic is vast and I barely scratched the surface here. I hope you can research the best suited solution for your problem yourself now, because stated as is there is no good answer really.

OP:

You're assuming all data are sampled from the same distribution, so on average the gradient computed at any step will be roughly equivalent to that of the global gradient. SGD is just gradient descent except, rather than computing the cost on the whole set, you just do it on a random subset. This is for speed. But the crucial point lies in that you're randomly selecting from a set whose distribution is equivalent to that of the entire set.

It can't escape the minima for the same reason that regular GD can't - the gradient always points "downhill", and if you're in a local minimum then, by definition, every direction is "uphill".

In general we don't know that there isn't a better (or worse) minimum. Indeed, many methods utilize lots of random initializations to try to mitigate problems from bad initial guesses (look up simulated annealing)
There is no definitive solution that works in every situation on avoiding local minima. Look up "gradient descent with momentum" - this is one method that adds a "momentum" term (inspired from physics) to help be robust against local minima, but if they are deep enough or "wide" enough, it will not matter.

This is why people try so hard to cast problems in a way where convex optimization can be used

Both BGD and SGD look for the nearest minimum. SGD is just a "low resolution" version of BGD.
We are not certain. I'm sure someone could do an analysis on the liklihood of leaving a local minimum of a certain description given the distribution the sample is pulled from, but I've not done this nor seen it done. It's not very important because in general we don't have a full description of the cost function in general. Remember, we are sampling the cost function at a point - in solving a real problem you would never have a graph like you do (otherwise, what's the point of SGD - just look at the graph)
This is a little weirdly phrased, but I think you're asking why gradient descent is more robust to small local minima. It stems from the stochasticity introduced by only sampling a small portion of the data to get your cost function. If you just happen to subsample points which have a gradient that pushes you out of a "true" local minimum, then great. You would never know that you have been saved from a "true" local minimum though.
Again, SGD was not invented to be robust against local minima. It was a simple extension of regular GD for the sake of speed and incremental updates in e.g. neural networks. If you want robustness against local minima, any form of regular GD is not what you're looking for. Again, think about adding a momentum term or checking out some other optimizers

Edit 1:

First, thanks to @Hyperplane for pointing out that SGD is not just used for speed, but also regularization. Check out the paper he linked below in the comments if interested.

Response to edits in the original post:

Since we randomly select each batch ... the whole dataset

This is essentially correct. We make the probabilistic assumption that all data are sampled from the same distribution. The random sample will exhibit different statistical attributes than the whole set (e.g. sample mean), but on average we expect that the gradient of the cost function computed from the subsample will be similar to that of the entire sample.

Note that in some cases in your GIF, it is actually quite far off - the expected difference will shrink as your subsample grows, and determining how big to make your subsample is a consideration you have to have.

However there is still a thing that confuses me ... pushed again towards a shallower minimum

There are a few things to say here.

I'm not sure if I can say with mathematical certainty that SGD is more likely to reach a better minimum - I have no proof of this. We can see intuitively that there could be reason for this, so it is "worth" trying in practice.
The step size relative to the "size" of the minimum matters. Let's assume for the sake of simplicity that our minimum sits at the bottom of a surface which describes the bottom half of a sphere. If the step size is 10 times the radius of sphere, we simply cannot mathematically get caught in this minimum (unless of course we happen to land exactly on the minimum)

On the other hand, if the radius of the sphere is 10e6 times greater than the step size, it is virtually impossible to leave this "bowl".
To directly answer your question about progressively "jumping up" to higher and higher minima, the main answer is the importance of tuning your learning rate (step size). Having a sufficiently small learning rate will make this virtually impossible.

Question 1 answer: Again, you're highlighting the importance of learning rate. With a sufficiently small learning rate, it is very unlikely that you would end up in a much higher minimum. If the learning rate is very high, it could happen or you could essentially just end up bouncing around randomly. If the learning rate is "medium" then you could end up in the global minimum or perhaps either of the first two to three "up the bowl" on either side of the global minimum. Depends on learning rate, small batch size, and the stochasticity introduced from random selection.

Despite the fact that ... In both these cases SGD can fail

I was actually going to answer something like this above, but I thought it was confusing. Either way here it is: The steepness of the minimum makes this very unlikely given a reasonable step size. Let's say for the sake of example that the gradient is -20 on the left and +20 on the right on the minimum. Remember that we expect our small batches to have similar gradients (derivatives) to the whole dataset. So we might expect a gradient range of say 15 to 25 in magnitude (negative on the left, positive on the right). We are still pointing in the right direction. Could you "hop out"? Sure. But it's very unlikely and would be a "fluke" of a random sample.

Question 2 answer: Depends on learning rate and initialization. If the learning rate is pretty low, then you could very well get stuck in any of the minima near x = -0.5, 0, or 2. This isn't really a bad thing in this particular case for either the 0 or 2 minima.

Question 3 answer: Again, "wide" is hard to define when you can have a function that is at any scale. Are we concerned with width-to-depth ratio (rhetorical)? In short the answer is no - if the learning rate is small enough then either will work. I can expand if you need.

Question 4 answer: I'm not sure what you're saying here. Guess the size of what - the "size" of the "bowls" around the minima? Recall that you can get much more complex behavior in higher dimensions (e.g. saddle points in 2D) which is where most applications lie (in higher dimensions I mean). In general, you don't know the "scale" of the relevant features of the function.

Take the 2nd function you added. It exists in the range -5 to 5 with important features on that order. You could've just as easily made this $-5^{-10}$ to $5^{-10}$, or $-5^{10}$ to $5^{10}$. A learning rate of say 0.01 could be useful for your function, but would be hopelessly useless for the other 2 scales (even impossible for the first)

Question 5 answer: As above, depth is important because even if the small batch gradients are not equal to the whole batch gradient, they will still generally be pointing in the right direction. The smaller the magnitude of the gradient, the easier it is for random sampling to go in a "bad" direction.

On the other hand, even though it is easier to go in a "bad" direction in a wider bowl, you have to take many "bad" steps in succession to leave the bowl. So there is a tradeoff, and we have seen again the importance of tuning your learning rate (and small batch size).

Look again at your GIF - around the big hills, the small batch derivates pretty closely math - at least they have the right direction. However, on the far left of your function, the derivatives don't as closely match.

Question 6 answer: You assume nothing beyond the assumptions of regular GD (namely continuity and differentiability). In general, SGD is used because it is faster and, as mentioned by @Hyperplane, is a form of regularization. It may leave local minima (but so too might regular GD with a high enough learning rate). I cannot give you a definitive rule for when it might be better to use one or the other. I'm not saying such a rule doesn't exist, but I do not know it if it does.

Edits 2:

Confusion 1 response: It depends what the source of the escape is. If you are asking whether or not the stochasticity introduced from using a minibatch is less likely to be the cause of leaving a deep minimum, then yes. The deeper the minimum is (given a fixed width), the higher the magnitude of the derivative and the less likely a minibatch will have a derivative that points in the "wrong" direction (i.e. has a different sign than the true whole batch derivative, in the 1D case).

If the source of the escape is that your learning rate is not tuned properly, then having a deeper minimum could actually make you more likely to leave. Remember, you're not really stepping "downhill" - you're taking a step left or right (in the 1D case) whose size depends on your learning rate and the derivative at that point. If the derivative has higher magnitude (i.e. the minimum is deeper), then it means you are taking a bigger step to the right or left. If the learning rate is high enough, you could very well take a step big enough to leave the well.

the smaller the batch size and the bigger the learning rate, the deeper minima the red point is looking for

This is not true for the reasons given in my last paragraph. Higher learning rate does not mean you are walking "downhill" with bigger steps, it means you are walking to the left or right with bigger steps and then "ending up on the function" at whatever altitude happens to correspond to that x value.

Imagine this: I have a global minimum at (0,0), and a global maximum at (1, 1000). Let's say I'm at x=-0.5, and the cost function happens to have a derivative of -1 there (i.e. "pointing" towards the minimum). Let's also assume that my learning rate is 1.5. **Then my update will take my current position and add $-derivative*learning\_rate=-(-1)*1.5=+1.5$. Therefore, my new x becomes $x_{new} = x_{old} + 1.5 = -0.5 + 1.5 = 1$. Look at that! I've ended up increasing my cost from 0 to 1000; and, worse yet, the derivative of the cost function at my new x value is 0, so I am stuck here.

This is obviously a very convenient and unlikely scenario, but it highlights the very important point that steps are taken on the x-axis, not on the "function surface". It is just to point out that you can leave minima if they are very skinny and deep if your learning rate is on the order of their width.

Confusion 2 response: We have been using the term "minimum" relatively loosely, and while it is convenient, there is a problem here. You have said

Now, when the red point gets into the minimum C, then each of 100 observations has 0 error and therefore 0 gradient. So, whatever batch you choose it's going to have 0 gradient, since its gradient is the sum of gradients of the observations the batch consists of. Consequently, it's impossible for the red point to escape from the minimum C.

We have been talking about being "in" a minimum as being in the "bowl" that surrounds it. The minimum itself is only the single point at which the derivative is 0. If you happen to land on a minimum exactly, then it is true you will never be able to leave. All updates will move you exactly 0. This is very unlikely. With the right learning rate, you will asymptotically converge to a minimum, but not reach it. Even if you happen to land on one (overwhelmingly unlikely for functions like those you show - the set of minima is measure zero), then you could still be pushed off of it just because of the errors introduced from floating-point computations.

I think the above might've answered your number 2 bullet under this confusion. In general, if you end up in a steep minimum with a low learning rate, you will never leave (steep and low are both relative here).

Confusion 3 answer: I'm a bit confused by your wording here. Importantly, "reduction in cost function" is not the same thing as a "deeper minimum" for the reason I mentioned above. Deeper minimum just means bigger magnitude derivative means bigger steps left or right. This could be highly detrimental.

Ultimately I think you shouldn't worry too much about this. Just think of SGD as a fast, low-resolution version of GD which happens to be slightly robust against extremely weak minima - this is more of an oddity than anything. If you are concerned with robustness then again look at (S)GD + Momentum.

Best Answer

Related Solutions

[Math] Gradient descent and penalty method

(conceptual question) What kind of minima do we expect stochastic gradient descent to get stuck on and why

OP:

Edit 1:

Response to edits in the original post:

Edits 2:

Related Question