Solved – How to the AIC or BIC be used instead of the train/test split

aicbiccross-validationtime seriestrain

I've recently come across several "informal" sources that indicate that in some circumstances, if we use the AIC or BIC to train a time series model, we don't need to split the data into test and train – we can use all the data for training. (Sources include among others, a discussion on Rob Hyndman's blog post on CV, this presentation from Stanford, or Section 4 of this text).

In particular, they seem to indicate that the AIC or BIC can be used when the data set is too small to allow for a train/test split.

Rob Hyndman's comment for example: "It is much more efficient to use AIC/BIC then to use test sets or CV, and it becomes essential for short time series where there is not enough data to do otherwise."

I can't seem however, to find any texts or papers that discuss this in detail.

One thing that especially puzzles me is that the AIC and BIC asymptotically tend towards cross-validation, which means that if possible at all, they would replace CV for large data sets – which goes against the idea of them being useful for small data sets.

Can anyone point me to a formal discussion (book chapters, papers, tutorials) of this idea?

Best Answer

In chapter 5.5 of this book, they discuss how a lot of these model selection criteria arise. They start with Akaike's FPE criterion for AR models, and then go on to discuss AIC, AICc and BIC. They walk through the derivations pretty thoroughly.

What these have in common is that they investigate what happens when you use some observed in-sample data $\{X_t\}$ to estimate the model parameters, and then look at some loss function (mean square prediction error or KL divergence) on some unobserved/hypothetical out-of-sample data $\{Y_t\}$ that arises from using the estimated model on this new data. The main ideas are that a) you take the expectation with respect to all of the data, and 2) use some asymptotic results to get expressions for some of the expectations. The quantity from (1) gives you expected overall performance, but (2) assumes you have a lot more data than you actually do. I am no expert, but I assume that cross-validation approaches target these measures of performance as well; but instead of considering the out-of-sample data hypothetical, they use real data that was split off from the training data.

The simplest example is the FPE criterion. Assume you estimate your AR model on the entire data (kind of like the test-set), and obtain $\{\hat{\phi}_i\}_i$. Then the expected loss on the unobserved data $\{Y_t\}$ (it's hypothetical, not split apart like in cross-validation) is \begin{align*} & E(Y_{n+1} -\hat{\phi}_1Y_n -\cdots - \hat{\phi}_p Y_{n+1-p} )^2 \\ &= E(Y_{n+1} -\phi_1Y_n -\cdots - \phi_p Y_{n+1-p} - \\ & \hspace{30mm} (\hat{\phi}_1 - \phi_1)Y_n - \cdots - (\hat{\phi}_p - \phi_p) Y_{n+1-p} )^2 \\ &= E( Z_t + (\hat{\phi}_1 - \phi_1)Y_n - \cdots - (\hat{\phi}_p - \phi_p) Y_{n+1-p} )^2 \\ &= \sigma^2 + E[E[((\hat{\phi}_1 - \phi_1)Y_n - \cdots - (\hat{\phi}_p - \phi_p) Y_{n+1-p} )^2 | \{X_t\} ]] \\ &= \sigma^2 + E\left[ \sum_{i=1}^p \sum_{j=1}^p (\hat{\phi}_i - \phi_i)(\hat{\phi}_j - \phi_j)E\left[ Y_{n+1-i}Y_{n+1-j} |\{X_t\} \right] \right] \\ &= \sigma^2 + E[({\hat{\phi}}_p -{\phi}_p )' \Gamma_p ({\hat{\phi}}_p -{\phi}_p )] \\ &\approx \sigma^2 ( 1 + \frac{p}{n}) \tag{typo in book: $n^{-1/2}$ should be $n^{1/2}$} \\ &\approx \frac{n \hat{\sigma}^2}{n-p} ( 1 + \frac{p}{n}) = \hat{\sigma}^2 \frac{n+p}{n-p} \tag{$n \hat{\sigma}^2/\sigma^2$ approx. $\chi^2_{n-p}$ }. \\ \end{align*}

I don't know of any papers off the top of my head that compare empirically the performance of these criteria with cross validation techniques. However this book does give a lot of resources about how FPE,AIC,AICc and BIC compare with each other.