What are the advantages, why would one use multiple LSTMs, stacked one side-by-side, in a deep-network? I am using a LSTM to represent a sequence of inputs as a single input. So once I have that single representation— why would I pass it through again?
I am asking this because I saw this in a natural-language generation program.
Best Answer
I think that you are referring to vertically stacked LSTM layers (assuming the horizontal axes is the time axis.
In that case the main reason for stacking LSTM is to allow for greater model complexity. In case of a simple feedforward net we stack layers to create a hierarchical feature representation of the input data to then use for some machine learning task. The same applies for stacked LSTM's.
At every time step an LSTM, besides the recurrent input. If the input is already the result from an LSTM layer (or a feedforward layer) then the current LSTM can create a more complex feature representation of the current input.
Now the difference between having a feedforward layer between the feature input and the LSTM layer and having another LSTM layer is that a feed forward layer (say a fully connected layer) does not receive feedback from its previous time step and thus can not account for certain patterns. Having an LSTM in stead (e.g. using a stacked LSTM representation) more complex input patterns can be described at every layer