[Math] Intuition for gradient descent with Nesterov momentum

gradient descentintuitionnonlinear optimizationnumerical optimizationreference-request

A clear article on
Nesterov’s Accelerated Gradient Descent
(S. Bubeck, April 2013)
says

The intuition behind the algorithm is quite difficult to grasp,
and unfortunately the analysis will not be very enlightening either.

This seems odd, for such a powerful method
("you do not really understand something unless you can explain it to your grandmother").

Can anyone point to plots / visualizations
of alternating M-steps and G-steps (momentum and gradient),
on real or tutorial problems ?
In particular, are "Nesterov ripples" (oscillations) common,
and if so, what to do ?

Best Answer

I just posted an answer on stats.stackexchange that is meant to help with gaining intuition about the difference between Classical Momentum (CM) and Nesterov's Accelerated Gradient (NAG), which also contains a visualization, though it isn't real or from a tutorial, but made up by me.

Following is a copy of my answer. (This meta.stackexchange answer seems to approve of this copy-paste pattern.)

tl;dr
Just skip to the image at the end.
NAG_ball's reasoning is another important part, but I am not sure it would be easy to understand without all of the rest.

CM and NAG are both methods for choosing the next vector $\theta$ in parameter space, in order to find a minimum of a function $f(\theta)$.

In other news, lately these two wild sentient balls appeared:

It turns out (according to the observed behavior of the balls, and according to the paper On the importance of initialization and momentum in deep learning, that describes both CM and NAG in section 2) that each ball behaves exactly like one of these methods, and so we would call them "CM_ball" and "NAG_ball":
(NAG_ball is smiling, because he recently watched the end of Lecture 6c - The momentum method, by Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, and thus believes more than ever that his behavior leads to finding a minimum faster.)

Here is how the balls behave:

Instead of rolling like normal balls, they jump between points in parameter space.
Let $\theta_t$ be a ball's $t$-th location in parameter space, and let $v_t$ be the ball's $t$-th jump. Then jumping between points in parameter space can be described by $\theta_t=\theta_{t-1}+v_t$.
Not only do they jump instead of roll, but also their jumps are special: Each jump $v_t$ is actually a Double Jump, which is the composition of two jumps:
- Momentum Jump - a jump that uses the momentum from $v_{t-1}$, the last Double Jump.
  A small fraction of the momentum of $v_{t-1}$ is lost due to friction with the air.
  Let $\mu$ be the fraction of the momentum that is left (the balls are quite aerodynamic, so usually $0.9 \le \mu <1$). Then the Momentum Jump is equal to $\mu v_{t-1}$.
  (In both CM and NAG, $\mu$ is a hyperparameter called "momentum coefficient".)
- Slope Jump - a jump that reminds me of the result of putting a normal ball on a surface - the ball starts rolling in the direction of the steepest slope downward, while the steeper the slope, the larger the acceleration.
  Similarly, the Slope Jump is in the direction of the steepest slope downward (the direction opposite to the gradient), and the larger the gradient, the further the jump.
  The Slope Jump also depends on $\epsilon$, the level of eagerness of the ball (naturally, $\epsilon>0$): The more eager the ball, the further the Slope Jump would take it.
  (In both CM and NAG, $\epsilon$ is a hyperparameter called "learning rate".)
  Let $g$ be the gradient in the starting location of the Slope Jump. Then the Slope Jump is equal to $-\epsilon g$.
So for both balls the Double Jump is equal to:$$v_t=\mu v_{t-1} -\epsilon g$$ The only difference between the balls is the order of the two jumps in the Double Jump.
CM_ball didn't think it mattered, so he decided to always start with the Slope Jump.
Thus, CM_ball's Double Jump is: $$v_{t}=\mu v_{t-1}-\epsilon\nabla f\left(\theta_{t-1}\right)$$
In contrast, NAG_ball thought about it for some time, and then decided to always start with the Momentum Jump.
Therefore, NAG_ball's Double Jump is: $$v_{t}=\mu v_{t-1}-\epsilon\nabla f\left(\theta_{t-1}+\mu v_{t-1}\right)$$

NAG_ball's reasoning
- Whatever jump comes first, my Momentum Jump would be the same.
  So I should consider the situation as if I have already made my Momentum Jump, and I am about to make my Slope Jump.
- Now, my Slope Jump is conceptually going to start from here, but I can choose whether to calculate what my Slope Jump would be as if it started before the Momentum Jump, or as if it started here.
- Thinking about it this way makes it quite clear that the latter is better, as generally, the gradient at some point $\theta$ roughly points you in the direction from $\theta$ to a minimum (with the relatively right magnitude), while the gradient at some other point is less likely to point you in the direction from $\theta$ to a minimum (with the relatively right magnitude).

Finally, yesterday I was fortunate enough to observe each of the balls jumping around in a 1-dimensional parameter space.
I think that looking at their changing positions in the parameter space wouldn't help much with gaining intuition, as this parameter space is a line.
So instead, for each ball I sketched a 2-dimensional graph in which the horizontal axis is $\theta$.
Then I drew $f(\theta)$ using a black brush, and also drew each ball in his $7$ first positions, along with numbers to show the chronological order of the positions.
Lastly, I drew green arrows to show the distance in parameter space (i.e. the horizontal distance in the graph) of each Momentum Jump and Slope Jump.

Appendix 1 - A demonstration of NAG_ball's reasoning

In this mesmerizing gif by Alec Radford, you can see NAG performing arguably better than CM ("Momentum" in the gif).
(The minimum is where the star is, and the curves are contour lines. For an explanation about contour lines and why they are perpendicular to the gradient, see videos 1 and 2 by the legendary 3Blue1Brown.)

An analysis of a specific moment demonstrates NAG_ball's reasoning:

The (long) purple arrow is the momentum sub-step.
The transparent red arrow is the gradient sub-step if it starts before the momentum sub-step.
The black arrow is the gradient sub-step if it starts after the momentum sub-step.
CM would end up in the target of the dark red arrow.
NAG would end up in the target of the black arrow.

Appendix 2 - things/terms I made up (for intuition's sake)

CM_ball
NAG_ball
Double Jump
Momentum Jump
Momentum lost due to friction with the air
Slope Jump
Eagerness of a ball
Me observing the balls yesterday

Appendix 3 - terms I didn't make up

The way CM and NAG behave:
- I mostly depended on section 2 in the paper On the importance of initialization and momentum in deep learning.
- In addition, An overview of gradient descent optimization algorithms (a blog post by Sebastian Ruder) really helped me understand CM and NAG (and much more).
Momentum coefficient - a term used at least by the paper
Learning rate

Best Answer

NAG_ball's reasoning

Appendix 1 - A demonstration of NAG_ball's reasoning

Appendix 2 - things/terms I made up (for intuition's sake)

Appendix 3 - terms I didn't make up

Related Solutions

[Math] Gauss-Newton versus gradient descent

Gradient descent

Gauss-Newton

[Math] Projected gradient descent with momentum

Related Question