[Math] Momentum in gradient descent

gradient descentneural networksnumerical optimization

I often read about the importance of momentum in machine learning, namely, in neural networks. And as the partial derivative of the cost function w.r.t. to the weights gives us gradient descent, actually the slope of value in loss function, I can not really understand what part momentum would play in this role. If

$$\text{momentum} = \text{mass} \times \text{velocity}$$

I really don't understand what could be mass or velocity with respect to gradient descent.

Is there any simple explanation? What is the relation?

Best Answer

There are two reasons momentum appears in training ANNs.

It lets the algorithm "power through" flat areas, local minima, and saddle points. Empirically speaking, high dimensional loss surfaces are not pretty and have lots of these; hence, empirically, momentum has been found to be useful.
It speeds up training (again, empirically), probably partly due (1), but also because it seems to set a more appropriate learning rate (faster in homogeneous areas; slower in rougher ones) and helps avoid oscillations.

Personally, I also speculate that it is helpful in regularizing the stochasticity inherent in SGD (i.e. you "keep around" some of the information from the last training batch).

As for the second part of your question (i.e. "what are the mass and velocity?"), normally SGD is written: $$ \theta_t = \theta_{t-1} - \eta\hat{\nabla}\mathcal{L}(\theta_{t-1}) $$ as an update rule. Note that I'm using $\hat{\nabla}$ because we don't have the gradient; we just have an estimator of it. We can think of this as integrating a physical system, where $\theta$ is position and $\hat{\nabla}$ is velocity. I suppose $\eta$ is essentially $\Delta t$.

Let's define a velocity update to be "weighted": $$ v_{t+1} = \alpha v_t + (1-\alpha)\hat{\nabla}\mathcal{L}(\theta_{t-1}) $$ $$ \theta_{t+1} = \theta_t - \beta v_{t+1} $$ so that the instantaneous velocity at time $t+1$ is controlled by the "mass" $\alpha$. When $\alpha$ is high (i.e. 1), we simply use the velocity from last time, i.e. we are entirely driven by momentum. When $\alpha$ is low (i.e. 0), there is no momentum and the "particle" simply follows where the current "force" is pushing. (Note that I would not take the physical analogy too far though...)

Check out this article on momentum.

Disregarding my own advice above, here's my take on the "physics".

Consider a particle with time-varying position $x_t$, velocity $ v_t $, and mass $m$, undergoing an external force $F(x_t,t)$. We have (by $F=ma$): $$ \frac{dx^2_t}{dt^2} = \frac{1}{m}F(x_t,t) $$ which we can rewrite as: $$ \frac{dv_t}{dt} = \frac{1}{m}F(x_t,t)\;\;\;\;\;\&\;\;\;\;\; \frac{dx_t}{dt} = v_t $$ Discretizing the two equations to first-order (well, symplectic Euler, I guess) gives: \begin{align} v_{t+1} &= v_t + \frac{\Delta t}{m}F(x_t,t) \\ x_{t+1} &= x_t + v_{t+1} \Delta t \end{align} Now, just let $x_t=\theta_t$, $\eta=\Delta t$, and set the external force to be $F(x_t,t) = F(\theta)=\gamma\hat{\nabla}\mathcal{L}(\theta)$. Notice that as $m\rightarrow 0$, the contribution of $F$ becomes higher (i.e. momentum doesn't matter), whereas if $m\rightarrow \infty$, $v_t$ becomes a constant for all $t$ (nothing can move a huge neutron star from whatever its current trajectory is, right?).

Also, let $p_t = mv_t$, $ \xi = \eta/m $, and $\kappa = \eta\gamma$. Then: \begin{align} p_{t+1} &= p_t + \kappa\hat{\nabla}\mathcal{L}(\theta_t) \\ \theta_{t+1} &= \theta_t + \xi p_{t+1} \end{align} This is pretty close to the momentum equations above, the main difference being the use of the weighted average. How can we get something more like that?

Well, let's try to add a friction force or drag, i.e: \begin{align} \frac{\partial x_t}{\partial t} &= v_t \\ \frac{\partial v_t}{\partial t} &= \frac{1}{m}\left( F(x_t) - \epsilon v_t \right) \end{align} Discretizing this one gives: $$ v_{t+1} = \left(1-\frac{\Delta t\epsilon}{m}\right)v_t + \frac{\Delta t}{m}F(x_t) \;\;\;\;\;\&\;\;\;\;\; x_{t+1} = x_t + \Delta t v_{t+1} $$ which we can translate into $$ p_{t+1} = \left(1-\frac{\eta\epsilon}{m}\right)p_t + \eta\gamma\hat{\nabla}\mathcal{L}(\theta_t) \;\;\;\;\;\&\;\;\;\;\; \theta_{t+1} = \theta_t + \frac{\eta}{m} p_{t+1} $$ which we can rewrite as $$ p_{t+1} = \alpha p_t + (1-\alpha)\hat{\nabla}\mathcal{L}(\theta_{t}) $$ $$ \theta_{t+1} = \theta_t - \beta p_{t+1} $$ where we have set: $$ \alpha = \frac{\eta\epsilon}{m},\;\;\;\;\; \beta = \frac{\eta}{m},\;\;\;\;\; \gamma = \frac{1}{\eta} - \frac{\epsilon}{m} $$ Notice that this last form is identical to the form of momentum at which we first looked!

Best Answer

Related Solutions

[Math] Intuition for gradient descent with Nesterov momentum

NAG_ball's reasoning

Appendix 1 - A demonstration of NAG_ball's reasoning

Appendix 2 - things/terms I made up (for intuition's sake)

Appendix 3 - terms I didn't make up

[Math] Why do we need gradient descent to minimize a cost function

Related Question