Solved – Sliding window validation for time series

data miningrapidminertime series

I have a broad question about sliding window validation. Specifically, I am looking at using Rapid Miner to predict future values of a financial series using "lagged" values of that series and other covariates. I have been experimenting with the windowing operator in this software and lagging the values to prepare for modeling. What I am confused about, and suspect this is a general process, not just something centric to Rapid Miner and thus I ask it here, is the sliding window training/evaluation process.

  1. Does anyone have sources to recommend for learning about sliding window processes for building data mining models on time series?

  2. Specifically when building a model, I think I understand that k instances are used to train a model (e.g. SVM) and the performance of this model is determined by predicting the next m records. Then, the window is slid forward some amount and the next k records are used for training and the evaluation is done on the subsequent m records. This continues until the end of the data.

Is my understanding correct?

How is a final model built for use on future data? Is it always re-trained on the last k records and these last k records would only be used to create the final model?

Best Answer

Your understanding about sliding window analysis is generally correct. You may find it helpful to separate the model validation process from the actual forecasting. In model validation, you use $k$ instances to train a model that predicts "one step" forward. Make sure each of your $k$ instances uses only information available at that particular time. This can be subtle, because it is easy to accidentally peek ahead into the future and pollute your out-of-sample test.

For example, you might accidentally use the entire time series history in feature selection, and then use those features to test the model at every step of time. This is cheating, and will give you an overestimate of accuracy. This is mentioned in Elements of Statistical Learning, but outside the sliding window time series context.

It is also easy to accidentally pollute with future information if some of your independent variables are asset returns. Say I use the return on an asset from time $t=21$ days to $t=28$ days to test at $t=21$ days. In this case, I have also polluted the out-of-sample test. Instead I would want to train with instances up to $t=21$ days, and test with one step at $t=28$ days.

When you have validated your model, and are happy with the parameters and feature selection, then you typically train with all of your data and forecast into the actual future.