Neural Networks – How to Process Sequences Longer Than Memory in LSTM

lstmneural networksrecurrent neural network

* Note: The premise of my question was incorrect in the first place. My question assumes that an LSTM maintains a separate set of weights for each time step in the memory it is given as a design parameter. As Sycorax pointed out, only a single set of weights is learned. So an LSTM cell does not care about how long a sequence is – it applies the same set of weights to each step in the sequence.

Please don't let my question convolute your understanding of LSTMs.


Terminology:

  • Cell: the LSTM unit containing input, forget, output gates and the hidden hT and cell state cT.
  • Hidden units/memory: How far back in time the LSTM is "unrolled". A hidden unit is an instance of the cell at a particular time.
  • A hidden unit is parameterized by [wT, cT, hT-1]: The gate weights for the current hidden unit, the current cell state, and the last hidden unit's output. Where wT represents input, output, forget gate weights.

An LSTM maintains separate gate weights wT for each hidden unit. This way it can treat different points in time of a sequence differently. (see *Note above)

Let's say an LSTM has 3 hidden units so has gate weights w1, 2, w3 for each of them. Then a sequence x1, x2,...xN comes through. I am illustrating the cell as it transitions over time:

@ t=1
xN....x3    x2    x1

                 [w1, c1, h0]

                     (c2, h1)

@ t=2
xN....x4    x3    x2                   x1

                 [w2, c2, h1]

                     (c3, h2)

@ t=3
xN....x5    x4    x3                   x2    x1

                 [w3, c3, h2]

                     (c4, h3)

But what happens at t=4? The LSTM only has memory, therefore gate weights, for 3 steps:

@ t=4
xN....x6    x5    x4                   x3    x2    x1

                 [w?, c3, h2]

                     (c4, h3)

What weights are used for x4 and all the following inputs? In essence, how are sequences that are longer than an LSTM cell's memory treated? Do the gate weights reset back to w1, or do they remain static at their latest value wT?


Edit: My question is not a duplicate of the LSTM inference question. It is asking about multi-step prediction from inputs. However, I am asking about what weights are used over time for sequences that are longer than the internal hidden cell states. The question of weights is not addressed in that answer.

Best Answer

The gates are a function of the weights, the cell state and the hidden state. The weights are fixed.

Consider the equation for the forget gate $f_t$: $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t]+b_f)$$ The forget gate uses the new data $x_t$ and the hidden state $h_{t-1}$, but $W_f$ and $b_f$ are fixed. This is why the LSTM only needs to keep the previous $h$ and the previous $c$.

More information: http://colah.github.io/posts/2015-08-Understanding-LSTMs/