Solved – LSTM for Time Series: lags, timesteps, epochs, batchsize

deep learninglstmmachine learningneural networkspython

I was doing the various Machine Learning Mastery tutorials but I got very confused. Some answers (for instance this and many others) helped me but I still am confused.

Difference between batch_size, timesteps, lags and what are the correct input dimensions?

I will provide you with an example.
I have a time series

timeSeries = np.array([[4,6,1,4,1,6,8,4,3,1,9,8,6,7,7,5]])

I want to do some predictions with it, using LSTM in Keras.

Predicting value at t

What are the batch_sizes, timesteps, epochs etc if I want to use past values to predict the one at t?

Suppose I want to use t-2 and t-1 to predict t. Then I can create this train datasets:

xtrain = np.array([[4,6
                    6,1
                    1,4
                    4,1
                    1,6
                    6,8
                    8,4
                    4,3
                    3,1
                    1,9
                    9,8
                    8,6
                    6,7
                    7,7]])

ytrain = np.array([[1,
                    4,
                    1,
                    6,
                    8,
                    4,
                    3,
                    1,
                    9,
                    8,
                    6,
                    7,
                    7,
                    5]])

Each column/feature in xtrain has one lag from the column of ytrain. This mean that the first column of xtrain will contain the values at t-2, while the second column of xtrain will contain values at t-1.

This is how I would set up the model:

model = Sequential()
model.add(LSTM(number_units, input_shape = (samples, timesteps, features))
model.add(Dense(1))
model.compile(loss= 'mse', optimizer = 'adam')

From my understanding samples would be equal to len(xtrain) = 14. features = xtrain.shape[1] = 2. But what would be timesteps?

The lag between the ytrain and the second column of xtrain is 1, and the lag between the second column of xtrain and the first column of xtrain is one again. So I am tempted to say that timesteps is 1? But surely it means something else. So what does it mean?

Also, if I put 1, I would have

   model = Sequential()
    model.add(LSTM(number_units, input_shape = (14, 1, 2))
    model.add(Dense(1))
    model.compile(loss= 'mse', optimizer = 'adam')

and to fit the model, I would have
model.fit(xtrain.reshape(xtrain.shape[0], 1, xtrain.shape[1]), epochs = e, batch_size = bs))
What would be batch size in this case and what epochs? Normally an epochs is when the NN has gone through the whole xtrain, while a batch_size is the number of training examples after which the model updates the weights. But does it even make sense in an LSTM?

So if I set batch_size equal to 3 for instance, what would the model actually do?

My understanding is:

it will take
[[4,6 6,1 1,4]]
feed this into the LSTM, and update the weights.
Then It would take
[[4,1 1,6 6,8]]
and update the weights, etc. After it arrives to [[6,7], [7,7]], it will count this as an epoch. Is this correct?

And what would change if I had put timesteps = 2?

What would have happened if I wanted to predict t, t+1`, etc? Would this influence the timesteps?

Best Answer

I also had this question before. On a higher level, in (samples, time steps, features)

samples are the number of data, or say how many rows are there in your data set
time step is the number of times to feed in the model or LSTM
features is the number of columns of each sample

For me, I think a better example to understand it is that in NLP, suppose you have a sentence to process, then here sample is 1, which means 1 sentence to read, time step is the number of words in that sentence, you feed in the sentence word by word before the model read all the words and get a whole context of that sentence, features here is the dimension of each word, because in word embedding like word2vec or glove, each word is interpreted by a vector with multiple dimensions.

The input_shape parameter in Keras is only (time_steps, num_features), more you can refer to this. That's basically how I understand this, hope make it clear for you.

Related Solutions

Solved – Why does the loss/accuracy fluctuate during the training? (Keras, LSTM)

There are several reasons that can cause fluctuations in training loss over epochs. The main one though is the fact that almost all neural nets are trained with different forms of stochastic gradient descent. This is why batch_size parameter exists which determines how many samples you want to use to make one update to the model parameters. If you use all the samples for each update, you should see it decreasing and finally reaching a limit. Note that there are other reasons for the loss having some stochastic behavior.

This explains why we see oscillations. But in your case, it is more that normal I would say. Looking at your code, I see two possible sources.

Large network, small dataset: It seems you are training a relatively large network with 200K+ parameters with a very small number of samples, ~100. To put this into perspective, you want to learn 200K parameters or find a good local minimum in a 200K-D space using only 100 samples. Thus, you might end up just wandering around rather than locking down on a good local minima. (The wandering is also due to the second reason below).
Very small batch_size. You use very small batch_size. So it's like you are trusting every small portion of the data points. Let's say within your data points, you have a mislabeled sample. This sample when combined with 2-3 even properly labeled samples, can result in an update which does not decrease the global loss, but increase it, or throw it away from a local minima. When the batch_size is larger, such effects would be reduced. Along with other reasons, it's good to have batch_size higher than some minimum. Having it too large would also make training go slow. Therefore, batch_size should be treated as a hyperparameter.

Solved – Number of samples vs timesteps for LSTM

RNN architectures are good at remembering previous time-steps along a sequence, because of the loops nature, allows information to persist. That's why if your data has temporal dependency it is a good approach to use them rather than using only dense layers which does not address these issues.

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop [1].

If I understood correctly your dataset consist on a scalar number for each day for the last 100 days, right? I suggest then that your input data should has a shape of:

[Batch_size, sequence_length, features]

Where in your case features will be 1 and sequence_length is a parameter that will allow you to process the temporal dependence. On your specific problem would be: how much do you want to remember for your forecast prediction? 5 days? 100 days? This eventually is a hyperparameter that you will have to find.

[1] http://colah.github.io/posts/2015-08-Understanding-LSTMs/