Solved – State-of-the-art algorithms for the training of neural networks with GRU or LSTM units

grulstmneural networksrecurrent neural network

I recently read a lot about neural networks using GRU or LSTM units. There are many easy to use frameworks like tensorflow that do not even require high knowledge about programming. Unfortunately, I never really found good information on how the training of those networks work. Simple backpropagation might probably not work for gated recurrent networks or is just too inefficient for networks with such a high number of variables to learn.

So my question is:
What are the state-of-the-art algorithms used for the initialization and training of neural networks with GRU or LSTM units? I am not looking for frameworks to use, but for initialization and update equations for the internal parameters.

Best Answer

This article is a good place to start.

"Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network" by Alex Sherstinsky

Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of "unrolling" an RNN is routinely presented without justification throughout the literature. The goal of this paper is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in signal processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the "Vanilla LSTM" network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this tutorial valuable as well.

This is a dense document with all of the equations your heart might desire. It would be difficult to reproduce all of the relevant materials here.

Another presentation can be found in "A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation" by Gang Chen.

We describe recurrent neural networks (RNNs), which have attracted great attention on sequential tasks, such as handwriting recognition, speech recognition and image to text. However, compared to general feedforward neural networks, RNNs have feedback loops, which makes it a little hard to understand the backpropagation step. Thus, we focus on basics, especially the error backpropagation to compute gradients with respect to model parameters. Further, we go into detail on how error backpropagation algorithm is applied on long short-term memory (LSTM) by unfolding the memory unit.

Also, if you're unfamiliar with backpropagation, we have a number of threads on the topic.

Regarding GRUs, I'm not aware of a similar paper. The promise of GRUs was supposedly that GRUs would provide comparable performance to LSTMs with a lower parameter count and fewer computations; results are mixed. For a comparison of LSTMs and GRUs, see Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling."

Related Solutions

Solved – Sudden accuracy drop when training LSTM or GRU in Keras

Here are my suggestion to pinpoint the issue:

1) Look at training learning curve: How is the learning curve on train set? Does it learn the training set? If not, first work on that to make sure you can over fit on the training set.

2) Check your data to make sure there is no NaN in it (training, validation, test)

3) Check the gradients and the weights to make sure there is no NaN.

4) Decrease the learning rate as you train to make sure it's not because of a sudden big update that stuck in a sharp minima.

5) To make sure everything's right, check the predictions of your network so that your network is not making some constant, or repetitive predictions.

6) Check if your data in your batch is balanced with respect to all classes.

7) normalize your data to be zero mean unit variance. Initialize the weights likewise. It will assist the training.

Solved – Understanding LSTM units vs. cells

The terminology is unfortunately inconsistent. num_units in TensorFlow is the number of hidden states, i.e. the dimension of $h_t$ in the equations you gave.

Also, from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard9/tf.nn.rnn_cell.RNNCell.md :

The definition of cell in this package differs from the definition used in the literature. In the literature, cell refers to an object with a single scalar output. The definition in this package refers to a horizontal array of such units.

"LSTM layer" is probably more explicit, example:

def lstm_layer(tparams, state_below, options, prefix='lstm', mask=None):
    nsteps = state_below.shape[0]
    if state_below.ndim == 3:
        n_samples = state_below.shape[1]
    else:
        n_samples = 1

    assert mask is not None
    […]

Best Answer

Related Solutions

Solved – Sudden accuracy drop when training LSTM or GRU in Keras

Solved – Understanding LSTM units vs. cells

Related Question