There are two reasons momentum appears in training ANNs.
It lets the algorithm "power through" flat areas, local minima, and saddle points. Empirically speaking, high dimensional loss surfaces are not pretty and have lots of these; hence, empirically, momentum has been found to be useful.
It speeds up training (again, empirically), probably partly due (1), but also because it seems to set a more appropriate learning rate (faster in homogeneous areas; slower in rougher ones) and helps avoid oscillations.
Personally, I also speculate that it is helpful in regularizing the stochasticity inherent in SGD (i.e. you "keep around" some of the information from the last training batch).
As for the second part of your question (i.e. "what are the mass and velocity?"), normally SGD is written:
$$
\theta_t = \theta_{t-1} - \eta\hat{\nabla}\mathcal{L}(\theta_{t-1})
$$
as an update rule. Note that I'm using $\hat{\nabla}$ because we don't have the gradient; we just have an estimator of it. We can think of this as integrating a physical system, where $\theta$ is position and $\hat{\nabla}$ is velocity. I suppose $\eta$ is essentially $\Delta t$.
Let's define a velocity update to be "weighted": $$ v_{t+1} = \alpha v_t + (1-\alpha)\hat{\nabla}\mathcal{L}(\theta_{t-1}) $$
$$ \theta_{t+1} = \theta_t - \beta v_{t+1} $$
so that the instantaneous velocity at time $t+1$ is controlled by the "mass" $\alpha$. When $\alpha$ is high (i.e. 1), we simply use the velocity from last time, i.e. we are entirely driven by momentum. When $\alpha$ is low (i.e. 0), there is no momentum and the "particle" simply follows where the current "force" is pushing. (Note that I would not take the physical analogy too far though...)
Check out this article on momentum.
Disregarding my own advice above, here's my take on the "physics".
Consider a particle with time-varying position $x_t$, velocity $ v_t $, and mass $m$, undergoing an external force $F(x_t,t)$.
We have (by $F=ma$):
$$ \frac{dx^2_t}{dt^2} = \frac{1}{m}F(x_t,t) $$
which we can rewrite as:
$$
\frac{dv_t}{dt} = \frac{1}{m}F(x_t,t)\;\;\;\;\;\&\;\;\;\;\;
\frac{dx_t}{dt} = v_t
$$
Discretizing the two equations to first-order (well, symplectic Euler, I guess) gives:
\begin{align}
v_{t+1} &= v_t + \frac{\Delta t}{m}F(x_t,t) \\
x_{t+1} &= x_t + v_{t+1} \Delta t
\end{align}
Now, just let $x_t=\theta_t$, $\eta=\Delta t$, and set the external force to be $F(x_t,t) = F(\theta)=\gamma\hat{\nabla}\mathcal{L}(\theta)$.
Notice that as $m\rightarrow 0$, the contribution of $F$ becomes higher (i.e. momentum doesn't matter), whereas if $m\rightarrow \infty$, $v_t$ becomes a constant for all $t$ (nothing can move a huge neutron star from whatever its current trajectory is, right?).
Also, let $p_t = mv_t$, $ \xi = \eta/m $, and $\kappa = \eta\gamma$. Then:
\begin{align}
p_{t+1} &= p_t + \kappa\hat{\nabla}\mathcal{L}(\theta_t) \\
\theta_{t+1} &= \theta_t + \xi p_{t+1}
\end{align}
This is pretty close to the momentum equations above, the main difference being the use of the weighted average. How can we get something more like that?
Well, let's try to add a friction force or drag, i.e:
\begin{align}
\frac{\partial x_t}{\partial t} &= v_t \\
\frac{\partial v_t}{\partial t} &= \frac{1}{m}\left( F(x_t) - \epsilon v_t \right)
\end{align}
Discretizing this one gives:
$$
v_{t+1} = \left(1-\frac{\Delta t\epsilon}{m}\right)v_t + \frac{\Delta t}{m}F(x_t)
\;\;\;\;\;\&\;\;\;\;\;
x_{t+1} = x_t + \Delta t v_{t+1}
$$
which we can translate into
$$
p_{t+1} = \left(1-\frac{\eta\epsilon}{m}\right)p_t + \eta\gamma\hat{\nabla}\mathcal{L}(\theta_t)
\;\;\;\;\;\&\;\;\;\;\;
\theta_{t+1} = \theta_t + \frac{\eta}{m} p_{t+1}
$$
which we can rewrite as
$$ p_{t+1} = \alpha p_t + (1-\alpha)\hat{\nabla}\mathcal{L}(\theta_{t}) $$
$$ \theta_{t+1} = \theta_t - \beta p_{t+1} $$
where we have set:
$$
\alpha = \frac{\eta\epsilon}{m},\;\;\;\;\;
\beta = \frac{\eta}{m},\;\;\;\;\;
\gamma = \frac{1}{\eta} - \frac{\epsilon}{m}
$$
Notice that this last form is identical to the form of momentum at which we first looked!
I think you might be mixing some things here. In general, gradient descent is probably the simplest optimization algorithm there is, and it requires very few assumptions. You do have to assume that the function is continuously differentiable. However, if you seek to ensure convergence to a global minimum or you wish to derive the complexity of the algorithm, things get a bit more complicated.
Given that you make no assumptions on the function, then gradient descent can only be guaranteed to advance towards a stationary point (local min/max or saddle point), if such a point exists. The other assumptions will give you additional guarantees:
- If we assume the function has a Lipschitz gradient - then we can derive iteration complexity (how much the algorithm "progresses" from one iteration to the other).
- If we assume the function is convex - then in case it will stop, it must be at a local minimum (however that minimum need not exist).
- If we assume the function has bounded level sets, then we are guaranteed it will converge to a stationary point. Bounded level sets can be attained in various ways, most notably if the domain is compact, or if the function is coercive or strictly convex.
- Strictly convex functions admit a unique global minimum (while convex functions might have a range of minimizers).
- Strongly convex functions give an improved iteration complexity.
Note that neural networks optimize cost functions that do not meet any of these assumptions. Moreover, they may not even be continuously differentiable. Therefore neural networks have no theoretical guarantees, but in practice they will converge to a local minimum or saddle point. The rate of convergence will be unknown, and in fact they will use a variant of gradient descent called stochastic gradient descent (SGD). SGD processes the data in batches at every iteration, since calculating and storing the gradients might be too expensive from the perspective of memory and computation resources required.
Best Answer
I think that Lipschitzness of $\ell$ is a sufficient condition for that identity to hold. Indeed, for any $i$ : $$\begin{align}\mathbb E\left[\frac{\partial \ell}{\partial w_i}\right] &= \mathbb E\left[\lim_{h\to0}\frac{\ell(w_i+h)-\ell(w_i)}{h}\right]\\ &=\mathbb E\left[\lim_{n\to\infty}\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ \end{align}$$ Now, if $\ell$ is $L$-Lipschitz, then the sequence $\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n} $ is dominated by $L$ and thus the Dominated Convergence Theorem applies. Hence $$\begin{align}\mathbb E\left[\frac{\partial \ell}{\partial w_i}\right]&=\mathbb E\left[\lim_{n\to\infty}\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ &=\lim_{n\to\infty} \mathbb E\left[\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ &=\lim_{n\to\infty}\frac 1 n\left(\mathbb E[\ell(w_i+1/n)] - \mathbb E[\ell(w_i)]\right)\\ &=\lim_{n\to\infty}\frac{\mathbb E[\ell(w_i+1/n)] - \mathbb E[\ell(w_i)]}{1/n}\\ &=\frac{\partial}{\partial w_i}\mathbb E[\ell]\end{align} $$