Solved – If we primarily use LSTMs over RNNs to solve the vanishing gradient problem, why can’t we just use ReLUs/leaky ReLUs with RNNs instead

artificial intelligencelstmmachine learningneural networksrecurrent neural network

RNNs as in: Recurrent Neural Networks

LSTMs as in: Long-Short Term Memory Units

ReLU as in: Rectified Linear Units

Leaky ReLU as in: Modified ReLUs that don't "die" when negative values are inputted.

In practice, machine learning practitioners rarely use vanilla RNNs, citing the vanishing gradient problem (where gradients almost die off after not too many time steps because they are small numbers multiplying by each other a lot, making it basically impossible to train) as the reason. LSTMs are known to solve this problem through their more complex architecture that enable for additive relationships between gradients instead of multiplicative; the latter is the culprit of the vanishing gradient problem.

However, with ANNs, ReLUs are known to solve this problem well, given that their gradient is either "off" (0) or "on" (1). Leaky ReLUs have a small gradient instead of being "off". They solve the problem because the gradients do not saturate.

So, why do we need to have such a complex model like LSTMs when RNNs + ReLU should be able to solve the problem? Is it just that LSTMs perform much better, and we don't have great reasoning as to why?

Best Answer

I think there's some confusion here. The reason you have vanishing gradients in neural networks (with say, softmax) is wholly different from RNNs. With neural networks, you get vanishing gradients because most initial conditions make your outputs end up on either the far left or far right of your softmax layer, giving it a vanishingly small gradient. In general it's difficult to select proper initial conditions, so people opted to use leaky ReLu's because they don't have the above problems.

Whereas with RNN's, the problem is that you are repeatedly applying your RNN to itself, which tends to cause either exponential blowup or shrinkage. See this paper for example:

On the difficulty of training recurrent neural networks: https://arxiv.org/abs/1211.5063

The suggestions of the above paper are: if the gradient is too large, then clip it to a smaller value. If the gradient is too small, regularize it via a soft constraint to not vanish.

There's a lot of research on LSTMs, and plenty of theories on why LSTMs tend to outperform RNNs. Here's a nice explanation: http://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html#fnref2