Question:
I want to be sure of something, is the use of k-fold cross-validation with time series is straightforward, or does one need to pay special attention before using it?
Background:
I'm modeling a time series of 6 year (with semi-markov chain), with a data sample every 5 min. To compare several models, I'm using a 6-fold cross-validation by separating the data in 6 year, so my training sets (to calculate the parameters) have a length of 5 years, and the test sets have a length of 1 year. I'm not taking into account the time order, so my different sets are :
- fold 1 : training [1 2 3 4 5], test [6]
- fold 2 : training [1 2 3 4 6], test [5]
- fold 3 : training [1 2 3 5 6], test [4]
- fold 4 : training [1 2 4 5 6], test [3]
- fold 5 : training [1 3 4 5 6], test [2]
- fold 6 : training [2 3 4 5 6], test [1].
I'm making the hypothesis that each year are independent from each other. How can I verify that?
Is there any reference showing the applicability of k-fold cross-validation with time series.
Best Answer
Time-series (or other intrinsically ordered data) can be problematic for cross-validation. If some pattern emerges in year 3 and stays for years 4-6, then your model can pick up on it, even though it wasn't part of years 1 & 2.
An approach that's sometimes more principled for time series is forward chaining, where your procedure would be something like this:
That more accurately models the situation you'll see at prediction time, where you'll model on past data and predict on forward-looking data. It also will give you a sense of the dependence of your modeling on data size.