Time Series and Machine Learning – Best Practices for Ordering Time Series Data in Machine Learning

cross-validationmachine learningtime series

After reading one of the "Research tips" of R.J. Hyndman about cross-validation and time series, I came back to an old question of mine that I'll try to formulate here. The idea is that in classification or regression problems, the ordering of the data is not important, and hence k-fold cross-validation can be used. On the other hand, in time series, the ordering of the data is obviously of a great importance.

However, when using a machine learning model to forecast time series, a common strategy is to reshape the series $\{y_1, …, y_T\}$ into a set of "input-output vectors" which, for a time $t$, have the form $(y_{t-n+1}, …, y_{t-1}, y_{t}; y_{t+1})$.

Now, once this reshaping has been done, can we consider that the resulting set of "input-output vectors" do not need to be ordered? If we use, for example, a feed-forward neural network with n inputs to "learn" these data, we would arrive at the same results no matter the order in which we show the vectors to the model. And therefore, could we use k-fold cross-validation the standard way, without the need to re-fit the model each time?

Best Answer

The answer to this question is that this will work fine as long as your model order is correctly specified, as then the errors from your model will be independent.

This paper here shows that if a model has poor cross-validation will underestimate how poor it actually is. In all other cases the cross-validation will do a good job, in particular, a better job than the out-of-sample evaluation usually used in the time series context.