Solved – 10-fold cross validation for forecasting time series with explanatory data

cross-validationforecastingmachine learningtime series

I saw that the question was asked some years ago here, but I wasn't satisfied with the answers so I'm asking it again. Is there some theoretical foundations about not doing k-fold cross validation with time series ? I've read that it makes more sense to not use forward looking data to predict past data, but if you expect the dependencies to have evolved with time, you should have modelled it so this is not shoking to mix all the data for tuning the algorithm, and use the end of the window for testing (for aesthetic purposes and plots).

Plus most of approaches don't take into account the time dependency so it seems artificial to use the rolling windows (local methodes use distances between regressors in the sense of the metric, not in a timely way).
I was working with this paper that doesn't even evoke the question.

The only point I would underline is that if you use lags and regressors and lags of regressors, the observations of the training test would be correlated which may be problematic for some methods more than others, right ?

Best Answer

I am not an expert on the field, but I believe the paper "On the use of cross-validation for time series predictor evaluation" by Christoph Bergmeir can help answer your question. It basically addresses the question of whether the use of specialized evaluation techniques for time series provide an advantage to take care of dependencies and time-evolving effects that may occur. They do not provide a strong theoretical background but performed several experiments and basically concluded that no practical problems with standard cross-validation could be found. However, they suggest the use of blocked cross-validation, together with an adequate control for stationarity, since it makes full use of all available information both for training and testing, thus yielding a robust error estimate.

Related Solutions

Solved – Step-by-step process for forecasting time series in R

This is a very broad question. I'll comment on a few points.

What do you mean by "I put zero values for the sales in that new dates that I've just created"? This sounds like you fill in zeros at the end for series that ended before a common end date - and then you'd attempt to forecast these zero sales for products that have been delisted. This wouldn't make sense, so I'm sure I'm misunderstanding something.

You should think long and hard about whether filling in NAs with zeros or using loocf (as below) makes sense. Maybe some of your products are seasonal and are simply not sold in summer or in winter? If so, filling in makes no sense. Yes, in such a case ARIMA will have problems, although AFAIK auto.arima() uses some kind of a state space approach that can deal with NAs. You may want to look at fitting simple linear models with a trend regressor and multiple seasonal dummies.
This looks fine to me.
I assume you are using the frequency parameter to tell R that you can have a 52-period seasonality?

It's unclear to me what kind of regressors you are setting up here. Prices? Promotions?
I don't think this makes sense. In step 8 below, you fit a regression with ARIMA errors, to take seasonality, autoregression and MA behavior into account, so you already believe that an OLS model is wrongly specified. So why fit it? Better to skip steps 4 and 5 and evaluate the regressions with ARIMA models directly on the holdout sample.
See above. Information criteria have only a tenuous relationship to forecasting accuracy, if that's what you are after.
Think about where your zeros come from. (See above.) Are your retail sales per store, or aggregated across multiple stores? If the latter, you shouldn't have too many zeros... except for delistings, again see above. If you have really low volume data, because of extreme slow movers, then regression/ARIMA doesn't make a lot of sense, because it inherently presupposes continuous data. I recently wrote a little article on forecasting count data in retail (Kolassa, 2016, International Journal of Forecasting, which may be interesting to you.

Taking logs may or may not be worthwhile.
I'd encourage you to look at each outlier separately and think about what may have caused that outlier. Data errors? Someone buying a heap of product? (That should only show up in a single store and even out if you are looking at aggregate data.) Plus, are you considering your regressors in assessing outliers? An "outlier" may simply be the effect of a successful promotion. Needless to say, you don't want to remove or "correct" these data, but capture and forecast the promotion effect.
See above. This should really be the core of your process. You can iterate the fitting over different subsets of regressors and evaluate each model. If you can't fit an ARIMA model, it makes sense to look at the time series, and the residuals from the regression, and try to understand what's happening. Sometimes you do run into numerical difficulties in fitting. In such a case ets() is a good fallback solution, or you could first run a regression on your regressors, then feed the residuals into ets().
See above. Logs may or may not improve your forecasts.
This makes sense, of course. mae is usually good. Also look at the distribution of residuals, especially if you have low volume count data (see the paper I referenced for details), and at whether your forecasts are systematically biased.

You may want to browse through some previous questions on forecasting in retail.

Solved – Working with Time Series data: splitting the dataset and putting the model into production

For standard statistical methods (ARIMA, ETS, Holt-Winters, etc...)

I don't recommend any form of cross-validation (even time series cross-validation is a little tricky to use in practice). Instead, use a simple test/train split for experiments and initial proofs of concept, etc...

Then, when you go to production, don't bother with a train/test/evaluate split at all. As you pointed out correctly, you don't want to loose valuable information present in the last 90 days. Instead, in production you train multiple models on the entire data set, and then choose the one that gives you the lowest AIC or BIC.

This approach, try multiple models then and pick the one with the lowest Information Criterion, can be thought of as intuitively using Grid Search/MSE/L2 regularization.

In the large data limit, the AIC is equivalent to leave one out CV, and the BIC is equivalent to K-fold CV (if I recall correctly). See chapter 7 of Elements of Statistical Learning, for details and a discussion in general of how to train models without using a test set.

This approach is used by most production grade demand forecasting tools, [including the one my team uses][1]. For developing your own solution, if you are using R, then auto.arima and ETS functions from the Forecast and Fable packages will perform this AIC/BIC optimization for you automatically (and you can also tweak some of the search parameters manually as needed, increase).

If you are using Python, then the ARIMA and Statespace APIs will return the AIC and BIC for each model you fit, but you will have to do the grid-search loop your self. There are some packages that perform auto-metic time series model selection similar to auto.arima, but last I checked (a few months back) they weren't mature yet (definitely not production grade).

For LSTM based forecasting, the philosophy will be a little different.

For experiments and proof of concept, again use a simple train/test split (especially if you are going to compare against other models like ARIMA, ETS, etc...) - basically what you describe in your second option.

Then bring in your whole dataset, including the 90 days you originally left out for validation, and apply some Hyperparameter search scheme to your LSTM with the full data set. Bayesian Optimization is one of the most popular hyperparameter tuning approaches right now.

Once you've found the best Hyperparameters, then deploy your model to production, and start scoring its performance.

Here is one important difference between LSTM and Statistical models:

Usually statistical models are re-trained every time new data comes in (for the various teams I have worked for, we retrain the models every week or sometimes every night - in production we always use different flavors of exponential smoothing models).

You don't have to do this for LSTM, instead you need only retrain it every 3~6 months, or maybe you can automatically re-trigger the retraining process when ever the performance monitoring indicates that the error has gone above a certain threshold.

BUT - and this is a very important BUT!!!! - you can do this only because your LSTM has been trained on several hundred or thousand products/time series simultaneously, i.e. it is a global model. This is why it is "safe" to not retrain an LSTM so frequently, it has already seen so many previous examples of time series that it can pick on trends and changes in a newer product without having to adapt the local time series specific dynamic.

Note that because of this, you will have to include additional product features (product category, price, brand, etc...) in order for the LSTM to learn the similarities between the different product. LSTM only performs better than statistical methods in demand forecasting if it is trained on a large set of different products. If you train a separate LSTM for each individual time series product, then you will almost certainly end up overfitting, and a statistical method is guaranteed to work better (and is easier to tune because of the above mentioned IC trick).

To recap:

In both cases, do retrain on the entire data set, including the 90s days validation set, after doing your initial train/validation split.

For statistical methods, use a simple time series train/test split for some initial validations and proofs of concept, but don't bother with CV for Hyperparameter tuning. Instead, train multiple models in production, and use the AIC or the BIC as metric for automatic model selection. Also, perform this training and selection as frequently as possible (i.e. each time you get new demand data).
For LSTM, train a global model on as many time series and products as you can, and using additional product features so that the LSTM can learn similarities between products. This makes it safe to retrain the model every few months, instead of every day or every week. If you can't do this (because you don't have the extra features, or you only have a limited number of products, etc...), don't bother with LSTM at all, and stick with statistical methods instead.
Finally, look at hierarchical forecasting, which is another approach that is very popular for demand forecasting with multiple related products.

Best Answer

Related Solutions

Solved – Step-by-step process for forecasting time series in R

Solved – Working with Time Series data: splitting the dataset and putting the model into production

Related Question