Solved – Time series model selection: AIC vs. out-of-sample SSE and their equivalence

aicarimacross-validationmodel selectiontime series

AIC is frequently recommended as criterion to compare models for time series forecasting. See for example this in the context of dynamic regression models:

The AIC can be calculated for the final model, and this value can be
used to determine the best predictors. That is, the procedure should
be repeated for all subsets of predictors to be considered, and the
model with the lowest AICc value selected.

Why not compare the models based on their out-of-sample performance? (e.g. choose the model with lowest SSE in out-of-sample forecasting). I've been reading several textbooks and websites on time series forecasting and haven't found this discussion. The closest I got was this blog entry about the Facts and fallacies of the AIC:

The AIC is not really an “in-sample” measure. Yes, it is computed
using the training data. But asymptotically, minimizing the AIC is
equivalent to minimizing the leave-one-out cross-validation MSE for
cross-sectional data, and equivalent to minimizing the out-of-sample
one-step forecast MSE for time series models. This property is what
makes it such an attractive criterion for use in selecting models for
forecasting.

In an example I've been working (couldn't post the plots here, though; I need more reputation in this site), I tried both approaches and most of the time AIC and out-of-sample SSE do not yield the same result. The procedure I used was as follows:

  1. I divided the data in training and test samples (at an arbitrary point; a question about this below)
  2. I estimated competing models (ARIMA with external regressors, changing ARIMA parameters and the regressors) using the training sample (first 230 periods; all models have the same number of observations so AIC is comparable).
  3. Then, I forecasted the series for the same periods as the test sample (periods 231-260).
  4. For each model, I calculated simple SSE like $SSE=\sum_{t=231}^{260}(\widehat{y_t}-y_t)^2$ where $y_t$ is the observed value of the series (test sample) and $\widehat{y_t}$ is the value forecasted by the model.
  5. I compared the model indicated by AIC (computed using training data) with the model with lowest out-of-sample SSE. Most of the time the selected models are different (and at least visually, those selected by SSE perform better).

If someone could explain to me what is going on behind this I would be very grateful. I am clearly not an expert in this. I am just trying to teach myself a little, so please excuse if I overlooked something important in the textbooks I've been reading.

Finally, a question regarding splitting the data in training and test samples for time series. It seems to me there is something fundamentally different than using the same procedure for cross sectional data. For cross sectional data you can take two random samples from your whole dataset. For time series this does not make much sense. So, you need to take an arbitrary point to split the series in training and test samples. The thing is that usually the best model is different for every arbitrary point. Perhaps that's why this approach does not seem to be frequently used. Is this the reason why AIC is preferred for model selection? (Given that "asymptotically, minimizing the AIC is … equivalent to minimizing the out-of-sample one-step forecast MSE for time series models".)

Best Answer

Why not compare the models based on their out-of-sample performance?

Of course, you can do that. I suppose that the advantage of AIC is faster computation and less coding (while AIC is often automatically reported as part of model diagnostics, cross validation for time series might not be readily available in your favourite software).

I tried both approaches and most of the time AIC and out-of-sample SSE do not yield the same result.

You do not seem to have implemented the cross validation properly. First, you split the data only once while you are supposed to split it multiple times. Second, you assessed forecasting performance based on one trial of forecasting multiple different horizons rather than multiple trials of forecasting one fixed horizon. Perhaps therefore you got the discrepancy between AIC and cross validation

When implementing cross validation in a time series setting, you may make use of rolling windows. You would take observations from $t$ to $t+m$ where $m$ is the window length and roll $t$ from 1 to $T-m-1$ where $T$ is the sample size. You would estimate your model in each rolling window and predict one period ahead. You would then collect these predictions and compare them the to the actual values. That would give you an out-of-sample metric of forecasting performance when using cross validation in a time series setting.

See also Hyndman and Athanasopoulos "Forecasting: principles and practice", section 2.5 (scroll all the way down) and Bergmeir et al. "A note on the validity of cross-validation for evaluating time series prediction" (2015, working paper).

at least visually, those [models] selected by SSE perform better

It could be that the model residuals did not quite have the assumed distribution or the model had some other faults invalidating its AIC in some way. That is one argument why out-of-sample forecast accuracy could be preferred over AIC in model selection.