Solved – What’s the difference between momentum based gradient descent and Nesterov’s accelerated gradient descent

gradient descentoptimization

So momentum based gradient descent works as follows:

$v=\beta m-\eta g$

where $m$ is the previous weight update, and $g$ is the current gradient with respect to the parameters $p$, $\eta$ is the learning rate, and $\beta$ is a constant.

$p_{new} = p + v = p + \beta m – \eta g$

and Nesterov's accelerated gradient descent works as follows:

$p_{new} = p + \beta v – \eta g$

which is equivalent to:

$p_{new} = p + \beta (\beta m – \eta g ) – \eta g$

or

$p_{new} = p + \beta^2 m – (1 + \beta) \eta g$

source: https://github.com/fchollet/keras/blob/master/keras/optimizers.py

So to me it seems Nesterov's accelerated gradient descent just gives more weight to the $\eta g$ term over the pervious weight change term m (compared to plain old momentum). Is this interpretation correct?

Best Answer

Arech's answer about Nesterov momentum is correct, but the code essentially does the same thing. So in this regard the Nesterov method does give more weight to the $lr \cdot g$ term, and less weight to the $v$ term.

To illustrate why Keras' implementation is correct, I'll borrow Geoffrey Hinton's example.
enter image description here

Nesterov method takes the "gamble->correction" approach.
$v' = m \cdot v - lr \cdot \nabla(w+m \cdot v)$
$w' = w + v'$
The brown vector is $m \cdot v$ (gamble/jump), the red vector is $-lr \cdot \nabla(w+m \cdot v)$ (correction), and the green vector is $m \cdot v-lr \cdot \nabla(w+m \cdot v)$ (where we should actually move to). $\nabla(\cdot)$ is the gradient function.

The code looks different because it moves by the brown vector instead of the green vector, as the Nesterov method only requires evaluating $\nabla(w+m \cdot v) =: g$ instead of $\nabla(w)$. Therefore in each step we want to

  1. move back to where we were $(1 \rightarrow 0)$
  2. follow the green vector to where we should be $(0 \rightarrow 2)$
  3. make another gamble $(2 \rightarrow 3)$

Keras' code written for short is $p' = p + m \cdot (m \cdot v - lr \cdot g) - lr \cdot g$, and we do some maths

$\begin{align} p' &= p - m \cdot v + m \cdot v + m \cdot (m \cdot v - lr \cdot g) - lr \cdot g\\ &= p - m \cdot v + m \cdot v - lr \cdot g + m \cdot (m \cdot v - lr \cdot g)\\ &= p - m \cdot v + (m \cdot v-lr \cdot g) + m \cdot (m \cdot v-lr \cdot g) \end{align}$

and that's exactly $1 \rightarrow 0 \rightarrow 2 \rightarrow 3$. Actually the original code takes a shorter path $1 \rightarrow 2 \rightarrow 3$.

The actual estimated value (green vector) should be $p - m \cdot v$, which should be close to $p$ when learning converges.

Related Question