Solved – Train/Test Splitting for Time Series

cross-validationtime series

I'm doing product demand forecasting using 3 years' worth of daily sales data. I've built a few models and now I wish to test them. Based on Rob Hyndman's book and this resource, an expanding window walk forward cross validation is the gold standard for evaluating models in a time series context.

My question is, what's the most appropriate splitting? Should I train it initially on the first year (365 days) and do the walk forward test on the remaining 2 years (730 days)? What I expect will happen is that the error in forecasting will decrease as the training set grows, meaning the error will be large when the training set is small. Is it valid to train it initially on the first two years instead, and do the walk forward validation on the last year?

Thanks and any insight would be appreciated.

Best Answer

Product demand data usually has a yearly seasonality. Training on the first year is not sufficient, as your model won't be able to capture any yearly seasonality or any long term trends. Most algorithms require at least 2 years of data for this reason (more would be better - but that's not always available for retail demand forecasting data).

At the same time you want to make sure that all of the seasonalities are present in your test set as well - so the optimal split in your case is 2 years training and 1 year testing.