Gradient Descent – Understanding Vanishing Gradient and Gradient Zero in Neural Networks

backpropagationgradient descentmachine learningneural networks

There is a well known problem vanishing gradient in BackPropagation training of Feedforward Neural Network (FNN)(here we don't consider the vanishing gradient of Recurrent Neural Network).

I don't understand why vanishing gradient does not mean the zero gradient namely the optimal solution we want? I saw some answer said vanishing gradient is not exactly the zero gradient, just means the update of parameter is very slow. However, in the gradient decent, we don't want to achieve the exact zero gradient and we will stop when the parameter unchanges within $\epsilon,$ which is the same case of vanishing gradient.

So can anyone give me a clear answer?

Best Answer

Relatively low gradient doesn't always mean we have reached a critical point

Having a low value in a component of the gradient doesn't neccesarily mean that we are close to a critical point for that parameter. It means that the function will change little if we make an update based solely on that gradient component.

For example think about the function $f(w_1,w_2) = 0.001w_1^2 + w_2^2\rightarrow$ for a point with similar coordinates we will have a gradient component $1000$ times bigger in the $w_2$ direction than in the $w_1$ direction.

So in that function (that we can interpret as our cost function) if we initialize our parameters to a similar value this will lead to a way slower improvement in the $w_1$ direction as we can see in the next contour plot for a learning rate of $0.5$:

enter image description here

As we can see, even being far from the minimum at $(0, 0)^T$, the improvements on $w_1$ are very little, so we need a lot of iterations to reach the minimum on $w_1$. And it reached the minimum after $1000$ iterations! So if we have initialized the algorithm with a farer point it wouldn't have reached the minimum at all.

So now we understand that even if we have a low value of the gradient, our parameters don't necessarily have to be close to the value that reach the minimum (or a critical point in general) of the cost function. In contrast, they may see their learning process severely slowed down.

In practice we can avoid this undesired behaviour using some modifications on the gradient descent method. For example see Adagrad. With this method, the components of the gradient are normalized based on the current and previous values of the gradient magnitude at each direction. Thereby we have a specific learning rate for each dimension.


Reasoning with backpropagation using a simple NN

To see why this smaller gradient components can happen also in neural networks, we can make use of a simple NN consisting only of one neuron per layer, just like the next one:

enter image description here

As we know, the element of the gradient given by the derivative of the cost function, $C$, with respect to a weight $w^l$ of the layer $l$, in a fully connected NN is given by the left term: $$\frac{\partial C}{\partial w^l}= \delta^l (a^{l-1})^T \,\,\,\,\,\,\,\,\xrightarrow[]{\text{in our simple NN}}\,\,\,\,\,\,\,\, \frac{\partial C}{\partial w^l}=\delta^l a^{l-1}$$

Where $\delta^l$ is the "error" term ($\partial C/\partial z^l$), and $a^{l-1}$ represents the vector of activations of the neurons from the previous layer ($l-1$). Note that in the simple NN presented above we don't need to transpose $a^{l-1}$ as it is a scalar (one neuron per layer $\rightarrow$ one activation per layer).

Tipically, we are able to calculate easily the "error" term of the last layer ($\delta^L$), but we don't know how to calculate it for the previous layers so we make use of backpropagation:

$$\delta^l = \left((w^{l+1})^T\,\,\delta^{l+1}\right)\odot g(z^l) \,\,\,\,\,\,\,\,\xrightarrow[]{\text{in our simple NN}}\,\,\,\,\,\,\,\, \delta^l=w^{l+1} \,g(z^l) \,\,\delta^{l+1} $$

Where $g(z^l)$ represents the activation function of the neuron given the term $z^l$.

So, for any layer $l$, how is the term $\partial C/ \partial w^l$ computed?. Using the previous reasoning for the simple NN, now we know that:

$$ \begin{align} \delta^{l} &= w^{l+1} \,g(z^{l}) \,\,\color{blue}{\delta^{l+1}}\\ \\ & = w^{l+1} \,g(z^{l}) \,\,\color{blue}{w^{l+2} \,g(z^{l+1}) \,\,\delta^{l+2}}\\ \\ &= w^{l+1}\color{blue}{w^{l+2}...w^{L}}\,\,g(z^{l})\color{blue}{g(z^{l+1})...g(z^{L})\,\,\delta^L} \end{align} $$ Where the blue terms are equivalent to $\delta^{l+1}$.

As we saw earlier, this term $\delta^l$ multiplied by the activation of the neuron from the previous layer $a^{l-1}$, gives us our desired $\partial C/\partial w^l$:

$$ \frac{\partial C}{\partial w^l} = \delta^{l}a^{l-1} = w^{l+1}\color{blue}{w^{l+2}...w^{L}}\,\,g(z^{l})\color{blue}{g(z^{l+1})...g(z^{L})\,\,\delta^L} \,\,a^{l-1}$$

So now we can see clearly that the updates that are made on a weight $w^l$ depend directly on the values of all the weights and activations of the subsequent layers.

This means that, for any value of $w^l$ (it could be very far from the optimum like in the situation plotted at the beggining), its component of the gradient may tend to zero if any of the weights or activations, $g(z)$ (or a combination of them) of the subsequent layers tend to zero. This undesired effect, as you said in the question, is known as vanishing gradient.

Now we understand that even if a weight presents a value of $\partial C/\partial w^l$ close to zero this doesn't mean that we have reached a critical point for that weight. What's more, the learning of these parameters will slow down significantly because their updates are proportional to their respective component of the gradient. So they may get stuck in a value far from a minimum!

Finally note that this undesired effect may become more important as the number of subsequent layers grows.

Related Question