Solved – How to split dataset for time-series prediction

cross-validationpartitioning

I have historic sales data from a bakery (daily, over 3 years). Now I want to build a model to predict future sales (using features like weekday, weather variables, etc.).

How should I split the dataset for fitting and evaluating the models?

  1. Does it need to be a chronological train/validation/test split?
  2. Would I then do hyperparameter tuning with the train and validation set?
  3. Is (nested) cross validation a bad strategy for a time-series problem?

EDIT

Here are some links I came across after following the URL suggested by @ene100:

  • Rob Hyndman describing "rolling forecasting origin" in theory and in practice (with R code)
  • other terms for rolling forecasting origin are "walk forward optimization" (here or here), "rolling horizon" or "moving origin"
  • it seems that these techniques won’t be integrated into scikit-learn in the near future, because “the demand for and seminality of these techniques is unclear” (stated here).

And this is another suggestion for time-series cross validation.

Best Answer

This link from Rob Hyndman's blog has some info that may be useful: http://robjhyndman.com/hyndsight/crossvalidation/

In my experience, splitting data into chronological sets (year 1, year 2, etc) and checking for parameter stability over time is very useful in building something that's robust. Furthermore, if your data is seasonal, or has another obvious way to split in to groups (e.g. geographic regions) then checking for parameter stability in those sub-groups can also help determine how robust the model will be and if it makes sense to fit separate models for separate categories of data.

I think that statistical tests can be useful but the end result should also pass the "smell test".

Related Question