Time Series – How to Cross-Validate Time-Series Analysis

cross-validationrtime series

I've been using the caret package in R to build predictive models for classification and regression. Caret provides a unified interface to tune model hyper-parameters by cross validation or boot strapping. For example, if you are building a simple 'nearest neighbors' model for classification, how many neighbors should you use? 2? 10? 100? Caret helps you answer this question by re-sampling your data, trying different parameters, and then aggregating the results to decide which yield the best predictive accuracy.

I like this approach because it is provides a robust methodology for choosing model hyper-parameters, and once you've chosen the final hyper-parameters it provides a cross-validated estimate of how 'good' the model is, using accuracy for classification models and RMSE for regression models.

I now have some time-series data that I want to build a regression model for, probably using a random forest. What is a good technique to assess the predictive accuracy of my model, given the nature of the data? If random forests don't really apply to time series data, what's the best way to build an accurate ensemble model for time series analysis?

Best Answer

The "classical" k-times cross-validation technique is based on the fact that each sample in the available data set is used (k-1)-times to train a model and 1 time to test it. Since it is very important to validate time series models on "future" data, this approach will not contribute to the stability of the model.

One important property of many (most?) time series is the correlation between the adjacent values. As pointed out by IrishStat, if you use previous readings as the independent variables of your model candidate, this correlation (or lack of independence) plays a significant role and is another reason why k-times cross validation isn't a good idea.

One way to overcome over this problem is to "oversample" the data and decorrelate it. If the decorrelation process is successful, then using cross validation on time series becomes less problematic. It will not, however, solve the issue of validating the model using future data

Clarifications

by validating model on future data I mean constructing the model, waiting for new data that wasn't available during model construction, testing, fine-tuning etc and validating it on that new data.

by oversampling the data I mean collecting time series data at frequency much higher than practically needed. For example: sampling stock prices every 5 seconds, when you are really interested in hourly alterations. Here, when I say "sampling" I don't mean "interpolating", "estimating" etc. If the data cannot be measured at higher frequency, this technique is meaningless