Solved – What’s the difference between a single output RNN and a MLP whose input data contains all the features of given time steps

lstmneural networksrecurrent neural network

When a RNN/LSTM contains output in each time step, I can understand that the output of current time steps is a function of its historical data.

But when one deals with a RNN that only has an output of the last time step, what's the difference between it and a MLP whose first layer contains all features of all the time steps? e.g.(slide 15 of this)

For example, when dealing with single output machine learning problem, given time steps = 4 and 2 input features in each time step, what's difference between it and a 8 inputs MLP?

Note that, for convenience, I put all the neurons of the hidden layer into a single hidden layer box.

Any help will be appreciated.

Best Answer

The way I like to describe this is that with RNN your net "understands" that your inputs are the same thing (say frame of video) and applies similar transformation to all of them - for instance if you are inputting a video input at a given timestep is a frame of the video and is separate from the previous frames. If you input all of the frames together your net might develop connections among unrelated inputs that are close to each other in the input (for instance first pixels of one frame and last pixels of the previous ones).

Another great benefit of RNNs is that it requires less trainable connections and allows for variable length of the input (videos can be several seconds or hours long) while FNN will require some heavy data manipulation and will not perform well if most of the inputs are 0s

Related Solutions

Solved – What happens when we feed a 2D matrix to a LSTM layer

1) X are your inputs, if you have 99 timesteps, then you have 99 vectors of size 13 each. Hence your input to each timestep is a vector that is of size 13. You will need a starting hidden state, unless you have a reason to do otherwise your beginning hidden state can be all 0's. The size of that vector is a hyperparameter you choose.

2) Keep in mind that there are not 99 LSTM cells, there is only 1 LSTM cell that is re-used 99 times for each timestep. The LSTM cell maintains a hidden state and a cell state within it that it passes forward to the next time step. But there is only 1 set of parameters being learned. Those parameters need to be able to handle all timesteps, conditional on the current input, hidden state, and cell state.

3) The cell state is not an output, however it is passed forward as an input to the next timestep. The hidden state h_t will be passed to the output as well as to the next timestep.

4) I'm not quite sure, I need a reference to the term output_dim.

This is an excellent tutorial on LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Solved – RNNs: When to apply BPTT and/or update weights

I'll assume we're talking about recurrent neural nets (RNNs) that produce an output at every time step (if output is only available at the end of the sequence, it only makes sense to run backprop at the end). RNNs in this setting are often trained using truncated backpropagation through time (BPTT), operating sequentially on 'chunks' of a sequence. The procedure looks like this:

Forward pass: Step through the next $k_1$ time steps, computing the input, hidden, and output states.

Compute the loss, summed over the previous time steps (see below).

Backward pass: Compute the gradient of the loss w.r.t. all parameters, accumulating over the previous $k_2$ time steps (this requires having stored all activations for these time steps). Clip gradients to avoid the exploding gradient problem (happens rarely).

Update parameters (this occurs once per chunk, not incrementally at each time step).

If processing multiple chunks of a longer sequence, store the hidden state at the last time step (will be used to initialize hidden state for beginning of next chunk). If we've reached the end of the sequence, reset the memory/hidden state and move to the beginning of the next sequence (or beginning of the same sequence, if there's only one).

Repeat from step 1.

How the loss is summed depends on $k_1$ and $k_2$. For example, when $k_1 = k_2$, the loss is summed over the past $k_1 = k_2$ time steps, but the procedure is different when $k_2 > k_1$ (see Williams and Peng 1990).

Gradient computation and updates are performed every $k_1$ time steps because it's computationally cheaper than updating at every time step. Updating multiple times per sequence (i.e. setting $k_1$ less than the sequence length) can accelerate training because weight updates are more frequent.

Backpropagation is performed for only $k_2$ time steps because it's computationally cheaper than propagating back to the beginning of the sequence (which would require storing and repeatedly processing all time steps). Gradients computed in this manner are an approximation to the 'true' gradient computed over all time steps. But, because of the vanishing gradient problem, gradients will tend to approach zero after some number of time steps; propagating beyond this limit wouldn't give any benefit. Setting $k_2$ too short can limit the temporal scale over which the network can learn. However, the network's memory isn't limited to $k_2$ time steps because the hidden units can store information beyond this period (e.g. see Mikolov 2012 and this post).

Besides computational considerations, the proper settings for $k_1$ and $k_2$ depend on the statistics of the data (e.g. the temporal scale of the structures that are relevant for producing good outputs). They probably also depend on the details of the network. For example, there are a number of architectures, initialization tricks, etc. designed to mitigate the decaying gradient problem.

Your option 1 ('frame-wise backprop') corresponds to setting $k_1$ to $1$ and $k_2$ to the number of time steps from the beginning of the sentence to the current point. Option 2 ('sentence-wise backprop') corresponds to setting both $k_1$ and $k_2$ to the sentence length. Both are valid approaches (with computational/performance considerations as above; #1 would be quite computationally intensive for longer sequences). Neither of these approaches would be called 'truncated' because backpropagation occurs over the entire sequence. Other settings of $k_1$ and $k_2$ are possible; I'll list some examples below.

References describing truncated BPTT (procedure, motivation, practical issues):

Sutskever (2013). Training recurrent neural networks.

Mikolov (2012). Statistical Language Models Based on Neural Networks.

Using vanilla RNNs to process text data as a sequence of words, he recommends setting $k_1$ to 10-20 words and $k_2$ to 5 words

Performing multiple updates per sequence (i.e. $k_1$ less than the sequence length) works better than updating at the end of the sequence

Performing updates once per chunk is better than incrementally (which can be unstable)

Williams and Peng (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories.

Original (?) proposal of the algorithm

They discuss the choice of $k_1$ and $k_2$ (which they call $h'$ and $h$). They only consider $k_2 \ge k_1$.

Note: They use the phrase "BPTT(h; h')" or 'the improved algorithm' to refer to what the other references call 'truncated BPTT'. They use the phrase 'truncated BPTT' to mean the special case where $k_1 = 1$.

Other examples using truncated BPTT:

(Karpathy 2015). char-rnn.

Description and code

Vanilla RNN processing text documents one character at a time. Trained to predict the next character. $k_1 = k_2 = 25$ characters. Network used to generate new text in the style of the training document, with amusing results.

Graves (2014). Generating sequences with recurrent neural networks.

See section about generating simulated Wikipedia articles. LSTM network processing text data as sequence of bytes. Trained to predict the next byte. $k_1 = k_2 = 100$ bytes. LSTM memory reset every $10,000$ bytes.

Sak et al. (2014). Long short term memory based recurrent neural network architectures for large vocabulary speech recognition.

Modified LSTM networks, processing sequences of acoustic features. $k_1 = k_2 = 20$.

Ollivier et al. (2015). Training recurrent networks online without backtracking.

Point of this paper was to propose a different learning algorithm, but they did compare it to truncated BPTT. Used vanilla RNNs to predict sequences of symbols. Only mentioning it here to to say that they used $k_1 = k_2 = 15$.

Hochreiter and Schmidhuber (1997). Long short-term memory.

They describe a modified procedure for LSTMs

Best Answer

Related Solutions

Solved – What happens when we feed a 2D matrix to a LSTM layer

Solved – RNNs: When to apply BPTT and/or update weights

Related Question