I am going through the derivation of neural network using this lecture pdf

And I am stuck on equation $(21)$

Note on notation:

- Activation function of layer $j$ is $y_j$
- Summation of weights of layer $j$ is $x_j$
- final label is $t$

I am trying to figure out where

$$\eta \Delta w_{kj} (n-1)$$

is coming from in the final equation $(21)$

$$ \Delta w_{kj}(n) = \alpha \delta_j y_k + \eta \Delta w_{kj} (n-1)$$

The author mentioned that it is a momentum term without really elaborating on it.

I thought $\Delta w_{kj}$ calculation is the following

$$\Delta w_{kj} = – \alpha \frac{\partial E}{\partial w_{kj}}$$

for 1 layer before final output layer:

$$\Delta w_{kj} = – \alpha (-(t_j-y_j))y_j(1-y_j)y_k$$

for all other layers:

$$\Delta w_{kj} = – \alpha (\delta_{i}w_{ji}) y_j(1-y_j)y_k$$

So what is the momentum term?

Can someone help me out ?

## Best Answer

Since you are asking for intuition, here's the general idea: Imagine that you are using stochastic gradient descent (SGD) to traverse the surface of the loss function. Further imagine that that surface looks like a mountain and that you are starting near the top. The surface has many small crags and lips but is clearly headed down to a valley. As you traverse the surface using SGD, the magnitude - and sign - of the derivative changes often, and so there is a great deal of fluctuation in the direction and speed that you take down the mountain. If you include a momentum term, then instead of just relying on the gradient at each new stopping point to determine your movement, your direction now also depends on the size and direction of your movement in the previous update. And since the amount and direction you moved last update also depended on the update before it you can see that each movement depends on the entire history of previous movements. So pretty soon you're running down the mountain toward the valley.

Notice that the further you are in time from an update, the less important it is for determining your current update, since, if you expand the momentum term at time $t$ then the contribution of the update from time $t-k$ is: \begin{equation} \alpha^{k+1}\Delta\omega_{t-k} \end{equation} Where $\alpha\in[0,1]$ is the momentum and $\Delta\omega_{t-k}$ is the weight update at time $t-k$