Time Series – Use of the Hidden State in an LSTM Network

lstmneural networkstime series

I am training an LSTM network for time series prediction. My understanding so far is that an LSTM network is suitable for time series prediction because it keeps a 'hidden state' which gives the LSTM network a 'notion' of what has happend in the past.

So you 'feed' the network information about, say, the last 10 days (day 1-10), in order to predict the value of the 11th day. Now we want to predict for the 12th day, so we input the sequence of the last 10 days (day 2-11). However, the network still remembers what has happened on the 1st day, because of the hidden state, correct?

If resetting the hidden state between each forward pass as advised here and also standard in the Keras libary as explained here, and also explained in this pytorch tutorial, what is the use of the hidden state? Because in that case it is not 'remembered in time', as I assume that all sequences in the same batch are processed in parallel, and sample i+1 is not aware of the hidden state produced by sample i because they are processed in parallel. What use does the hidden state have in such a case? Because as far as I understand the hidden state in that case does not transfer information through time/between samples. Could we not just increase the sequence length to give the network knowledge about what has happened previously?

QUESTION: Assuming my understanding of the hidden state of a LSTM correct, what is the use of the hidden state if it is reset between batches?

Best Answer

Your understanding is kind of correct. Yes, the purpose of the hidden state is to encode a history. Let's say your input is the sequence of data from day 2 to 11, the encoded history in the hidden state is due to the data from day 2 to 11 only. So your batches should contain all the history needed for each output prediction. You can also use another RNN to decode the hidden state into a sequence of predictions, if that suits your needs better.

You definitely need to reset the hidden state between batches. When you're training a NN with something like SGD, you assume that your data is iid. Therefore by your assumption, subsequent batches are independent of one another. Surely you don't want your hidden state from your past prediction to influence the next one in this setting.