Time Series – Using Test Data in Model to Make Predictions

cross-validationlstmtime series

If I have say 1 year daily close price of a stock and I divide it in ratio of 80:20 as train:test data.

Now I use TimeSeriesGenerator to fit the model on train data.

After fitting the model I want to test for that I would use say last 20 records from train dataset to predict the next record inorder to compare with the first record in test dataset.

Now i wanted to ask is inorder to predict the 2nd record should I replace the last record in train data with the predicted data or first data from test dataset?

What I mean is in the below code

test_predictions = []

first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))

for i in range(len(test)):
    
    # get the prediction value for the first batch
    current_pred = model.predict(current_batch)[0]
    
    # append the prediction into the array
    test_predictions.append(current_pred) 
    
    # use the prediction to update the batch and remove the first value
    current_batch = np.append(current_batch[:,1:,:],[[scaled_test[i]]],axis=1)

at the last line should I use scaled_test[i] or current_pred ?

Best Answer

You've trained the model on the whole train set, and now you're moving onto test.

If I understand correctly, you want to know: for the 2nd point in the test set, you can replace the last data point your model sees as the predicted value for the 1st test value, or the real test set value.

This depends on whether in real implementation, you would actually know that 1st value by the time the 2nd one came around. Basically, if by day 2, you would actually know the real closing price from day 1, then use the actual price! No harm in that, because there's no data leakage since you would have actually known the real price from day 1.

Related Question