Neural Networks – How to Process Sequences Longer Than Memory in LSTM

lstmneural networksrecurrent neural network

* Note: The premise of my question was incorrect in the first place. My question assumes that an LSTM maintains a separate set of weights for each time step in the memory it is given as a design parameter. As Sycorax pointed out, only a single set of weights is learned. So an LSTM cell does not care about how long a sequence is – it applies the same set of weights to each step in the sequence.

Please don't let my question convolute your understanding of LSTMs.

Terminology:

Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.
Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.
A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.

An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently. (see *Note above)

Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:

@ t=1
xN....x3    x2    x1

                 [w1, c1, h0]

                     (c2, h1)

@ t=2
xN....x4    x3    x2                   x1

                 [w2, c2, h1]

                     (c3, h2)

@ t=3
xN....x5    x4    x3                   x2    x1

                 [w3, c3, h2]

                     (c4, h3)

But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:

@ t=4
xN....x6    x5    x4                   x3    x2    x1

                 [w?, c3, h2]

                     (c4, h3)

What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?

Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.

Best Answer

The gates are a function of the weights, the cell state and the hidden state. The weights are fixed.

Consider the equation for the forget gate $f_t$: $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t]+b_f)$$ The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.

More information: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Related Solutions

Solved – Using an RNN/LSTM to generate sequences with a unique output

If you like, you could do this by writing a special processing function for this gist I wrote:

https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31

import tensorflow as tf


def self_feeding_rnn(cell, seqlen, Hin, Xin, processing=tf.identity):
    '''Unroll cell by feeding output (hidden_state) of cell back into in as input.
       Outputs are passed through `processing`. It is up to the caller to ensure that the processed
       outputs have suitable shape to be input.'''
    veclen = tf.shape(Xin)[-1]
    # this will grow from [ BATCHSIZE, 0, VELCEN ] to [ BATCHSIZE, SEQLEN, VECLEN ]
    buffer = tf.TensorArray(dtype=tf.float32, size=seqlen)
    initial_state = (0, Hin, Xin, buffer)
    condition = lambda i, *_: i < seqlen
    print(initial_state)
    def do_time_step(i, state, xo, ta):
        Yt, Ht = cell(xo, state)
        Yro = processing(Yt)
        return (1+i, Ht, Yro, ta.write(i, Yro))

    _, Hout, _, final_ta = tf.while_loop(condition, do_time_step, initial_state)

    ta_stack = final_ta.stack()
    Yo = tf.reshape(ta_stack,shape=((-1, seqlen, veclen)))
    return Yo, Hout

If your code is something like:

# how your network might work:
W = tf.Variable(shape=(state_size, 3), ... )
B = tf.Variable(shape=(3,), ... )
Yo, Ho = tf.nn.dynamic_rnn( cell, input, state )
# ( lat lon temp ) 3-vectors
predictions = tf.nn.matmul(Yo, W) + B

You could use the gist as:

# using self_feeding_rnn
from magic import temperature_sampler


def process_yt(yt):
    p = tf.nn.matmul(yt, W) + B
    real_temp = temperature_sampler[p[...,0],p[...,1]]
    # remove final element (temp) and add on proper temp
    return tf.concat((p[...,:-1], real_temp), axis=-1)

Yo, Ho = self_feeding_rnn(cell, seed, initial_state, processing=process_yt)

This makes the crux of your problem getting the temperature data into a tensorflow understandable format (some sort of 2D sampler). I have no experience working with such things, but in the worst case, you can just round your lat,lon to integers and grab from a constant array (using tf.constant, not np.ndarray so that you can index with tensors).

If you are still working on this I would love to help and feel free to ask me any questions!

Solved – the intuition behind a Long Short Term Memory (LSTM) recurrent neural network

As I understand your questions, what you picture is basically concatenating the input, previous hidden state, and previous cell state, and passing them through one or several fully connected layer to compute the output hidden state and cell state, instead of independently computing "gated" updates that interact arithmetically with the cell state. This would basically create a regular RNN that only outputted part of the hidden state.

The main reason not to do this is that the structure of LSTM's cell state computations ensures constant flow of error through long sequences. If you used weights for computing the cell state directly, you'd need to backpropagate through them at each time step! Avoiding such operations largely solves vanishing/exploding gradients that otherwise plague RNNs.

Plus, the ability to retain information easily over longer time spans is a nice bonus. Intuitively, it would be much more difficult for the network to learn from scratch to preserve cell state over longer time spans.

It's worth noting that the most common alternative to LSTM, the GRU, similarly computes hidden state updates without learning weights that operate directly on the hidden state itself.

Best Answer

Related Solutions

Solved – Using an RNN/LSTM to generate sequences with a unique output

Solved – the intuition behind a Long Short Term Memory (LSTM) recurrent neural network

Related Question