Solved – Meaning of batch sizes for RNNs/LSTMs and reasons for padding

backpropagationkeraslstmneural networksrecurrent neural network

I've got a two conceptual questions about RNNs, particularly LSTMs, which I just can't figure out on my own or with the tutorials I find on the internet. I would really appreciate if you could help me with the following:

  1. If I understand correctly, the states learned within a LSTM are only relevant for one sequence. So, for the next sequence the states are being "relearned" due to $s_{t}=f(Ux_{t} + Ws_{t-1})$ with x being the input at timestep t, s being the state at timestep t and U and W being the matrices that are learned. Is there any good reason why you should use larger batch sizes than 1 with RNNs/LSTMs especially? I know the differences between stoachastic gradient descent, batch gradient descent and Mini-batch gradient descent, but not why the latter two should be preferred over the first one in RNNs/LSTMs.
  2. Why do you need the same sequence lengths within a batch, i.e. why is padding needed? The states are calculated for each sequence separately, so I don't see a reason for this. Does the backprop through time need the same number of states for each sequence, when it's being executed after a batch?

Best Answer

  1. The state isn't really what is being learned. The weights that determine the state is where the learning happens. The state just holds some abstract representation of what it has seen so far in the sequence, so yes the state is only relevant to the current sequence.
    The advantages of larger batch sizes are better parallelization and smoothing out the gradient so the updates aren't so noisy, and it has no effect on the cells between different training sequences.
  2. You are correct that padding is not necessary and you can operate on sequences of different lengths. But the code is easier to write when it expects all sequences of the same length, so that's usually what you'll see.