Have you looked into RMSProp? Take a look at this set of slides from Geoff Hinton:
Overview of mini-batch gradient descent
Specifically page 29, entitled 'rmsprop: A mini-batch version of rprop', although it's probably worth reading through the full set to get a fuller idea of some of the related ideas.
Also related is Yan Le Cun's No More Pesky Learning Rates
and Brandyn Webb's SMORMS3.
The main idea is to look at the sign of gradient and whether it's flip-flopping or not; if it's consistent then you want to move in that direction, and if the sign isn't flipping then whatever step you just took must be OK, provided it isn't vanishingly small, so there are ways of controlling the step size to keep it sensible and that are somewhat independent of the actual gradient.
So the short answer to how to handle vanishing or exploding gradients is simply - don't use the gradient's magnitude!
A very short answer:
LSTM decouples cell state (typically denoted by c
) and hidden layer/output (typically denoted by h
), and only do additive updates to c
, which makes memories in c
more stable. Thus the gradient flows through c
is kept and hard to vanish (therefore the overall gradient is hard to vanish). However, other paths may cause gradient explosion.
A more detailed answer with mathematical explanation:
Let's review CEC (Constant Error Carousel) mechanism first. CEC says, from time step t
to t+1
, if the forget gate is 1 (There's no forget gate in the original LSTM paper, thus this is always the case), the gradient $dl/dc^{t}$ can flow without change.
Following to the BPTT formulae in paper LSTM: A Search Space Odyssey Appendix A.2 (y
in the paper is h
in other literature), the CEC flow actually corresponds to the equation $\delta c^t = \dots + \delta c^{t+1} \odot f^{t+1}$. When $f^{t+1}$ is close to 1, $\delta c^{t+1}$ accumulates to $\delta c^t$ losslessly.
However, LSTM is more than CEC. Apart from the CEC path from $c^{t}$ to $c^{t+1}$, other paths do exist between two adjacent time steps. For example, $y^t \rightarrow o^{t+1} \rightarrow y^{t+1}$. Walking through the back propagation process over 2 steps, we have: $\delta y^t \leftarrow R^T_o \delta o^{t+1} \leftarrow \delta y^{t+1} \leftarrow R^T_o \delta o^{t+2}$, we see $R^T_o$ is multiplied twice on this path just like vanilla RNNs, which may cause gradient explosion. Similarly, paths through input and forget gate are also capable of causing gradient explosion due to self-multiplication of matrices $R^T_i, R^T_f, R^T_z$.
Reference:
K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J.Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.
Best Answer
"Skip connections eliminate singularities" by A. Emin Orhan, Xaq Pitkow offers an explanation: residual connections ameliorate singularities in neural networks.
I'm not aware of a layer or batch normalization strategy that can accomplish this.