Solved – Why can RNNs with LSTM units also suffer from “exploding gradients”

backpropagationlstmneural networksrecurrent neural network

I have a basic knowledge of how RNNs (and, in particular, with LSTMs units) work. I have a pictorial idea of the architecture of an LSTM unit, that is a cell and a few gates, which regulate the flow of values.

However, apparently, I haven't fully understood how LSTM solves the "vanishing and exploding gradients" problem, which occurs while training, using back-propagation through time, a conventional RNN. I haven't had the opportunity to read the papers to fully understand the math.

This answer gives a brief explanation of how RNNs with LSTM units solve the "vanishing gradients" problem. Mathematically, the reason seems to be the inexistence of a derivative which does not vanish, i.e. does not tend to zero. Consequently, the author states, "there is at least one path where the gradient does not vanish". IMHO, this explanation is a bit vague.

Meanwhile, I was reading the paper Sequence to Sequence Learning with Neural Networks (by Ilya Sutskever, Oriol Vinyals, Quoc V. Le), and, in that paper, section "3.4 Training details", it is stated

Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients.

I have always thought that RNNs with LSTM units solve both the "vanishing" and "exploding gradients" problems, but, apparently, RNNs with LSTM units also suffer from "exploding gradients".

Intuitively, why is that? Mathematically, what are the reasons?

Best Answer

A very short answer:

LSTM decouples cell state (typically denoted by c) and hidden layer/output (typically denoted by h), and only do additive updates to c, which makes memories in c more stable. Thus the gradient flows through c is kept and hard to vanish (therefore the overall gradient is hard to vanish). However, other paths may cause gradient explosion.

A more detailed answer with mathematical explanation:

Let's review CEC (Constant Error Carousel) mechanism first. CEC says, from time step t to t+1, if the forget gate is 1 (There's no forget gate in the original LSTM paper, thus this is always the case), the gradient $dl/dc^{t}$ can flow without change. Following to the BPTT formulae in paper LSTM: A Search Space Odyssey Appendix A.2 (y in the paper is h in other literature), the CEC flow actually corresponds to the equation $\delta c^t = \dots + \delta c^{t+1} \odot f^{t+1}$. When $f^{t+1}$ is close to 1, $\delta c^{t+1}$ accumulates to $\delta c^t$ losslessly.

However, LSTM is more than CEC. Apart from the CEC path from $c^{t}$ to $c^{t+1}$, other paths do exist between two adjacent time steps. For example, $y^t \rightarrow o^{t+1} \rightarrow y^{t+1}$. Walking through the back propagation process over 2 steps, we have: $\delta y^t \leftarrow R^T_o \delta o^{t+1} \leftarrow \delta y^{t+1} \leftarrow R^T_o \delta o^{t+2}$, we see $R^T_o$ is multiplied twice on this path just like vanilla RNNs, which may cause gradient explosion. Similarly, paths through input and forget gate are also capable of causing gradient explosion due to self-multiplication of matrices $R^T_i, R^T_f, R^T_z$.

Reference:

K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J.Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.

Best Answer

Related Solutions

Solved – Why do RNNs have a tendency to suffer from vanishing/exploding gradient

TL;DR

Long Version

Cause 1: The unrolled network is usually very deep

Cause 2: The product that gives the gradient contains many instances of the same term

Solved – Are there any ways to deal with the vanishing gradient for saturating non-linearities that doesn’t involve Batch Normalization or ReLu units

Related Question