Solved – Gradient decay in neural networks

classificationgradient descentmachine learningneural networks

I read that in traditional feed-forward neural nets the gradients in the early layers decay very quickly and that this is 'a bad thing'. But I don't understand why. Can someone please explain what does that mean? or at least direct me to places to read about it and understand.

Best Answer

Think about the derivative of a cascade of functions,

$$\frac{\partial}{\partial \theta} f(g(h(\theta)) = \frac{\partial}{\partial g} f(g(h(\theta)) \cdot \frac{\partial}{\partial h} g(h(\theta)) \cdot \frac{\partial}{\partial \theta} h(\theta).$$

In a traditional neural network, you have a cascade of linear mappings and point-wise nonlinearities. If your nonlinearity is the logistic sigmoid, the derivatives $\frac{\partial}{\partial g} f$ will be smaller than 1, so that when you multiply many of these you get a very small gradient. The derivatives of parameters in early layers will contain more such factors than later layers, and hence will tend to be smaller.

This is a bad thing if you are using gradient descent, because you will be making tiny updates to the first layer parameters or huge updates to the last layer parameters. You could try to choose a different step width for different types of parameters, or you could use second-order methods which automatically scale the gradients in the right way.

Related Question