Solved – Is autocorrelation in a supervised learning dataset a problem

autocorrelationnonlinearrandom forestsupervised learningtime series

Imagine the following problem. I have weekly snapshots of price data of K items, as well as of various features/predictors. I want to predict how much the price will change 2 years from now.

I assemble my dataset as follows: each row consists of features for each item for each week, and the output variable is forward 2 year price returns. The date of the observation is not in my dataset – I only use it to separate the dataset into a training and validation set, i.e. in cross-validation (where I discard 2 years of data before and after the validation time-period (which is 1 year) to prevent data snooping).

Clearly, the samples from two consecutive weeks for the same item (and even for different items) will be highly correlated, both in terms of the features and in terms of the response variable (as the forward 2 years will largely overlap, meaning the returns will be very similar). What potential problems can it cause for supervised learning approaches, e.g. random forests or gradient boosted trees?

My thoughts are:

  1. The effective size of the dataset will be smaller than expected. I.e. my dataset of, say, 100'000 observations will behave more like a dataset of 100'000 / (52*2) ~= 1000 observations, because that is the number of samples for which response will not have autocorrelation. That will significantly affect the complexity of models that I can fit to the data, i.e. I will have significant overfitting problems and have much poorer results than expected.
  2. Because of features being very near each other on consecutive weeks for each item in the feature space, my dataset will cover the feature space a lot worse than expected, again reducing the "effective" size of the dataset.
  3. Using only 1 year of data for validation in cross-validation will result in high variance of cross-validation results, because once again, the effective number of samples in the validation set will be ~K rather than 52*K.

Are these valid concerns? If yes, does it mean that with K~=100, I will need hundreds, if not thousands years of data to be able to train a reasonably complex non-linear model from hundreds of features, e.g. using random forests or gradient boosted trees? Or am I being over-pessimistic and my argument about "effective dataset size" above is nonsensical?

Best Answer

You touch on an issue that has a parallel in the econometric literature. It's called the long-horizon predictability problem. While it's difficult to predict the stock markets and currencies in the short term, some econometric studies have shown that long term returns are "much more predictable" using covariates like dividend yields.

Well, it turns out there is a subtle flaw in these models. Since both the response and the predictors cover an overlapping period, they're highly autocorrelated across horizons, and the data points are not independent.

Here is a couple of papers I could find in my library. The Berkowitz paper is probably the most devastating on the subject.

A study that shows long horizon predictability :

Mark, N. C., & Choi, D. Y. (1997). Real exchange-rate prediction over long horizons. Journal of International Economics, 43(1), 29-60.

Criticism and statistic tests :

Berkowitz, J., & Giorgianni, L. (2001). Long-horizon exchange rate predictability?. The Review of Economics and Statistics, 83(1), 81-91.

Boudoukh, J., Richardson, M., & Whitelaw, R. F. (2006). The myth of long-horizon predictability. The Review of Financial Studies, 21(4), 1577-1605.

Richardson, M., & Smith, T. (1991). Tests of financial models in the presence of overlapping observations. The Review of Financial Studies, 4(2), 227-254.

Related Question