Time Series – Re-training the Entire Time Series After Cross-Validation

cross-validationforecastingtime series

In his book, Hands-on Time Series analysis with R, the author Rami Krispin says 'Typically, once we have trained and tested the model using the training and testing partitions, we will retrain the model with all of the data (or at least the most recent observation in the chronological order)'.

My question is this:

In time series cross-validation methods such as expanding and sliding window, the most recent observations fall within the test set, of course due to the chronological order.

Intuitively, the most recent observations can be the most influential predictors, although this is not always true. But, for the cases where the most recent observations are predictive, aren’t we missing the information from the most recent observations by not using them for training? If so, what are your thoughts on measuring the model performance using one of the time-series cross validation method first but then re-training the entire data for the final model, as Rami suggest?

But, when using the entire data for training, there is the danger of overfitting and, no validation.

Also, let’s say I put aside the last 10% of the time series as a test set for out of sample predictions. Now, the remaining 90% is the total train-validation set. When using a cross-validation method, the validation set (say another 10%) must be also the most recent, chronologically. At maximum, the remaining 80% is all I have for my model training and parameter tuning. After the cross-validation step, I have now a single chosen model with determined hyperparameters. Next, I retrain the entire 90% with this model and have the adjusted new parameters (based on the 90%), but still the model type itself was selected using the first 80% of the data. For example, if I am looking at a 10 year of historical data, my model is selected based on the first 8 years and that makes me wonder as well.

Any thoughts? Thanks.

Best Answer

This issue is indeed a bit of a problem in time series forecasting. (And more generally in prediction, if your test sample can be suspected of differing systematically from the training and validation samples.) I would make two points here.

First, whether the most recent data is really "most influential" is very much open to debate, and will depend heavily on your use case. If you are forecasting demand for a new product, yes. (But then you would probably be using specialized models, like the Bass and cross-train them on other products - not choose the model based on a holdout set of the focal time series.) But when my forecast consumers ask me to "put more emphasis on recent observations" or similar, I always push back unless they can provide an actual argument for why the data generating process should have evolved or changed recently. (This prior of mine may reflect that I am working in a very mature industry.)

Second, if there are actual reasons to suppose the DGP has changed, you should indeed treat the time series differently, and not rely on a holdout validation sample. For instance, you might use specialized models, like the Bass mentioned above. Or you might only use the most recent data, consider this a short time series and use an appropriate method. Or take one model fitted to the entire series and another one fitted only to the last observations and take the average of the two forecasts.

Bottom line: you really need to think about the time series you are forecasting. (Or trust in an automatic system and live with potentially lower accuracy - that may well be a rational use of your time.)

Related Solutions

Solved – Do I cross-validate the entire dataset, even the validation and test set

The syntax for cv.glm is clouding the issue here.

In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.

However, cv.glm is a bit of a special case. If you look at its the documentation for cv.glm, you do need to fit an entire model first. Here's the example at the very end of the help text:

require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)

The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.

This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.

It outputs a list and the delta component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.

* I say fine, but it is annoying to fit an entire model just to initialize something. You might think you could do something clever like

my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta

but it doesn't actually work because it uses the $y$ values from the overall model instead of the data argument to cv.glm, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.

Time Series – Cross-Validation Techniques for Time Series Data

Sliding window is perhaps the most straightforward solution for time series, see e.g. Hyndman & Athanasopoulos "Forecasting Principles and Practice" Chapter 2.5 (bottom of the page) and Rob J. Hyndman's blog post "Time series cross-validation: an R example".

However, Bergmeir et al. "A note on the validity of cross-validation for evaluating time series prediction" (working paper) suggest that regular leave-$K$-out cross validation may work well even in a time series context when purely autoregessive models are used. Here is the abstract:

In this work we have investigated the use of cross-validation procedures for time series prediction evaluation when purely autoregressive models are used, which is a very common use-case when using Machine Learning procedures for time series forecasting. In a theoretical proof, we showed that a normal K-fold cross-validation procedure can be used if the lag structure of the models is adequately specified. In the experiments, we showed empirically that even if the lag structure is not correct, as long as the data are fitted well by the model, cross-validation without any modification is a better choice than OOS evaluation. Only if the models are heavily misspecified, are the cross-validation procedures to be avoided as in such a case they may yield a systematic underestimation of the error.

Precise conditions for that to hold are laid out in the working paper.

Best Answer

Related Solutions

Solved – Do I cross-validate the entire dataset, even the validation and test set

Time Series – Cross-Validation Techniques for Time Series Data

Related Question