Solved – Backtesting/cross-validation for time-series and prediction intervals

cross-validationprediction intervaltime series

Suppose I carry out the following exercise with my trusty statistics software:

  • Fit some time-series model to the data $y_1,\dots,y_t$ and calculate $\hat{y}_{t+1}$, the fore­cast of the next obser­va­tion, and the error $e_{t+1}^*=y_{t+1}-\hat{y}_{t+1}$ for that fore­cast observation.
  • Repeat the previous step for $t=m,\dots,n-1$ where $m$ is the min­i­mum num­ber of obser­va­tions needed for fit­ting my model.
  • I plot the distribution of the errors $e_{m+1}^*,\dots,e_{n}^*$ or calculate its percentiles

What relationship does that distribution have to the analytical prediction interval for $\hat y_{t+1}$?

My intuition is that the distribution of the errors from this iterative CV process does not tell you very much about the variability of the prediction from the forecast made using the final version of the model. As the model is trained on more data, the errors will tend to decline with each step. So the large errors will be from early versions of the model and the small errors will come from the later versions. The final version of the model is more like the small-error late version, so it makes no sense to consider the early large errors as coming from that final model's error distribution. Even if there's no improvement in the model as it is fed more data, many time series model produce analytical prediction intervals. That will tell you if the difference in the actual and the prediction you observe is an outlier or not.

Best Answer

Your intuition seems right to me. Assume, for example, that $$ y_{n + 1} = \sum_{k = 1}^m \theta_k y_{n + 1 - k} + \epsilon_{n + 1} $$ where $\epsilon_{n + 1} \sim N(0, 1)$. If you are choosing to fit a model with at least $x$ observations then I am assuming you want the predicted model parameters $\hat{\theta}_1, \dots, \hat{\theta}_m$ to be stable - so that they do not change much as your training set increases. In this case, $\theta_i - \hat{\theta}_i \approx w_i$, for some $w_i$, so long as the training set is at least as large as $x$. We now have \begin{eqnarray} y_{n + 1} - \hat{y}_{n + 1} &=& \sum_{k = 1}^m \theta_k y_{n + 1 - k} - \hat{\theta}_k y_{n + 1 - k} + \epsilon_{n + 1} \\ &\approx& \sum_{k = 1}^m w_k y_{n + 1 - k} + \epsilon_{n + 1}. \end{eqnarray} Therefore, given the observations $y_1, \dots, y_n$, the $\epsilon_{n + 1}$ should be independent of $\epsilon_1, \epsilon_2, \dots, \epsilon_n$. In particular, the prediction interval should be generated by $$ N\left(\hat{y}_{n + 1} + \sum_{k = 1}^m w_k y_{n + 1 - k}, 1 \right). $$ Because of this, there does not appear to be a direct link between the prediction interval of $\hat{y}_{n+1}$ and prior errors. I would like to mention, however, that if the errors are correlated $$ E[\epsilon_i\epsilon_j \mid y_1, \dots, y_j] \neq 0, ~~~ i < j $$ then I believe it shouldn't be hard to show that the estimated RMSE produced by the cross-validation procedure you describe will be biased. This makes it seem to me like the procedure, as a form of CV for model selection, is not reliable.

Related Question