Solved – what is vanishing gradient

deep learninggradientmachine learningneural networks

I have seen the word "vanishing gradient" many times in deep learning literature. what is that? gradient respect to what variable? input variable or hidden units?

Does that mean the gradient vector is all zero? Or the optimization stuck in local minima / saddle point?

Best Answer

If you do not carefully choose the range of the initial values for the weights, and if you do not control the range of the values of the weights during training, vanishing gradient would occur which is the main barrier to learning deep networks. The neural networks are trained using the gradient descent algorithm: $$w^{new} := w^{old} - \eta \frac{\partial L}{\partial w}$$ where $L$ is the loss of the network on the current training batch. It is clear that if the $\frac{\partial L}{\partial w}$ is very small, the learning will be very slow, since the changes in $w$ will be very small. So, if the gradients are vanished, the learning will be very very slow.

The reason for vanishing gradient is that during backpropagation, the gradient of early layers (layers near to the input layer) are obtained by multiplying the gradients of later layers (layers near to the output layer). So, for example if the gradients of later layers are less than one, their multiplication vanishes very fast.

With this explanations these are answers to your questions:

  • Gradient is the gradient of the loss with respect to each trainable parameters (weights and biases).
  • Vanishing gradient does not mean the gradient vector is all zero (except for numerical underflow), but it means the gradients are so small that the learning will be very slow.