Solved – Understanding how to batch and feed data into a stateful LSTM

keraslstmrecurrent neural networktensorflowtime series

Let me use daily price prediction of Bitcoin as a simple example (I am not actually working with Bitcoin but its temporal nature fits well to explain my question).

Say I had a data set consisting of the last 101 sequential days of Bitcoin's closing prices [p1, p2, p3, ... , p101], where pX is the closing price on day X. The inputs will be the first 100 days, and the labels will be [p2, ... , p101] – the inputs shifted by 1 (attempting to predict the next day's closing price).

As I understand how a stateful LSTM works, I could divide my 100 training examples into 4 sequences of 25 examples. Each of these 4 will be a single batch – therefore the input to my LSTM of (batchSize, timeSteps, features) would be (1, 25, 1).

Each epoch would consist of 4 batches. I would first feed in batch1 = [p1, ... , p25] (and the labels for each time step [p2, ... , p26]), and pass the final state as the initial state to process batch2 = [p26, ... , p50], and so on. After all 4 batches are processed, and the epoch is complete, I would then reset the state and repeat for as many epochs as necessary.

If the LSTM could accurately predict the following day's price using the previous 25 days as an input sequence, I would then like to use it to make daily, real-time predictions of prices, not once every 25 days.

It is currently day 101, and I would like to make a prediction for day 102, p102. I would feed in [p77, ... , p101] as input. Then tomorrow, on day 102, I would like to predict p103, so I feed in [p78, ... , p102]. These batches are no longer continuing on sequentially from each other, but instead are shifted one day forward. How would I deal with the state of the LSTM when doing so? On each of these days would I feed in the previous 100 days as 4 batches of 25 so that the state is built up before I then make my prediction for tomorrow?

In reality, I am working on a much more complex problem with a far more extensive data set. I thought I understood how a stateful LSTM works until I just trained it as explained above in sequential batches. However, I then decided to do this process of shifting each input by one day each batch on the exact same training set. When doing this, the model's accuracy was far lower to what it was during training.

I thought that if I trained a stateful LSTM on 100 examples in 4 batches of 25, I could then take any arbitrary sequence of 25 examples from this same 100 and it would predict the following day with the same accuracy as training.

Edit

To make things clearer, here is how my data would be batched to train over 2 epochs, and then make 3 daily predictions after training

TRAINING:

Epoch 1 inputs:
[p1, ... , p25]
[p26, ... , p50]
[p51, ... , p75]
[p76, ... , p100]

Reset State

Epoch 2 inputs:
[p1, ... , p25]
[p26, ... , p50]
[p51, ... , p75]
[p76, ... , p100]

PREDICTION:

(Reset State?)
(Build up state by processing [p2, ... , p76] in 3 batches of 25?)

Inputs to predict price on p102:
[p77, ... , p101]

(Reset State?)
(Build up state by processing [p3, ... , p77] in 3 batches of 25?)

Inputs to predict price on p103:
[p78, ... , p102]

(Reset State?)
(Build up state by processing [p4, ... , p78] in 3 batches of 25?)

Inputs to predict price on p104:
[p79, ... , p103]

Best Answer

You're conflating two different things with regard to LSTM models.

The batch size refers to how many input-output pairs are used in a single back-propagation pass. This is not to be confused with the window size used as your time series predictors - these are independent hyper-parameters.

The normal way to solve this would be to pick a window size (let's say 25 since that was what you proposed). Now say that we use an LSTM network to predict the 26th point using the previous 25 as predictors. You would then repeat that process for each of the remaining points (27-100) using the preceding 25 points as your inputs in each case. That will yield you exactly 75 training points.

Batch size will dictate how many of these points are grouped together for backprop purposes. If you picked 5, for instance, you'd get 15 training batches (75 training points divided into 5 batches).

Note that this is a very small amount of data, so unless you use a very small NNet or heavy regularization, you're going to be at great risk of overfitting. You'd normally want to do a train-test split to be able to perform out-of-sample validation on the model, but given how few data points you have to work with that's going to be a bit tough.

Related Question