If I have say 1 year daily close price of a stock and I divide it in ratio of 80:20 as train:test data.
Now I use TimeSeriesGenerator to fit the model on train data.
After fitting the model I want to test for that I would use say last 20 records from train dataset to predict the next record inorder to compare with the first record in test dataset.
Now i wanted to ask is inorder to predict the 2nd record should I replace the last record in train data with the predicted data or first data from test dataset?
What I mean is in the below code
test_predictions = []
first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))
for i in range(len(test)):
# get the prediction value for the first batch
current_pred = model.predict(current_batch)[0]
# append the prediction into the array
test_predictions.append(current_pred)
# use the prediction to update the batch and remove the first value
current_batch = np.append(current_batch[:,1:,:],[[scaled_test[i]]],axis=1)
at the last line should I use scaled_test[i]
or current_pred
?
Best Answer
You've trained the model on the whole train set, and now you're moving onto test.
If I understand correctly, you want to know: for the 2nd point in the test set, you can replace the last data point your model sees as the predicted value for the 1st test value, or the real test set value.
This depends on whether in real implementation, you would actually know that 1st value by the time the 2nd one came around. Basically, if by day 2, you would actually know the real closing price from day 1, then use the actual price! No harm in that, because there's no data leakage since you would have actually known the real price from day 1.