Time Series – Why AIC or BIC is Commonly Used in Model Selection for Time Series Forecasting

aicarimaforecastingmodel selectiontime series

On scikit-learn documentation, I found the following comments about AIC:

Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

My questions are:

  1. Why would AIC break when we have more features than samples?
  2. Why is AIC and BIC commonly used in forecasting model like ARIMA?

Best Answer

What alternatives do we have in model selection for prediction?

  • The main ones are cross validation and information criteria.

Why are the latter attractive in the time series setting?

  • Information criteria are less computationally intensive. You only need to fit the model once to calculate an information criterion. This is in contrast to most applications of cross validation. Computational efficience is extra desirable in the time series setting as many basic time series models (ARMA, GARCH and the like) tend to be rather computationally demanding (more so than, say, linear regression).
  • Information criteria are also more effective in utilizing the data, as the model is estimated on the entire sample rather than just a training subset. The latter is important in small data sets* and especially in time series settings. In small data sets, we do not want to leave out too much data for testing, as then there is very little data left for training/estimation. We have leave-one-out cross validation (LOOCV) which leaves out only a single observation at a time in training/estimation, and it works well in a cross-sectional setting. However, it is often inapplicable in the time series setting due to the mutual dependence of the observations. Other types of validation that are applicable are much more data-costly. For more details, see "AIC versus cross validation in time series: the small sample case".

*Information criteria have an asymptotic justification, so their use is not unproblematic in small samples. Nevetherless, a more efficient use of the data is more desirable than a less efficient use. By using the entire sample for estimation you are closer to asymptotics than by using, say, 2/3 of the sample.