Solved – If the AIC and the BIC are asymptotically equivalent to cross validation, is it possible to dispense with a test set when using them

aicbiccross-validation

Several sources I've come across state that the AIC and the BIC are asymptotically equivalent to cross-validation (see multiple answers here for example, and here), .

When training a predictive model, is this useful in practice, or is it just a theoretical consideration?

Empirically, is there a sample size above which we can dispense with using a test set and use all the data we have for training, given this asymptotic equivalence?


I've recently come across a professional demand forecasting which does automated forecast generation. Based on the documentation, they don't split their time series into train and test, they simply tain a bunch of models and then just take the model with the best BIC – which seems to hint at the idea that the AIC and BIC are useful in practice, but I'm not sure.

Could the above mentioned equivalence between BIC and CV be why this package works this way – or is there another reason why they dispense with train/test split and simply go with the BIC as a model selection criteria?

(note that this package is designed to handle millions of time series concurrently)

Best Answer

AIC is asymptotically equivalent to leave-1-out cross-validation (LOOCV) (Stone 1977) and BIC is equivalent to leave-k-out cross-validation (LKOCV) where $k=n[1−1/(\log(n)−1)]$, with $n=$ sample size (Shao 1997). So if you are happy with LOOCV or LKOCV in terms of optimizing model prediction error and consistency of selection, respectively, then yes you could in principle get rid of splitting the data in a training & test set. Note that within the context of L0-penalized GLMs (where you penalize the log-likelihood of your model based on lambda * the nr of nonzero coefficients, i.e. the L0-norm of your model coefficients) you can also optimize the AIC or BIC objective directly, as $\lambda = 2$ for AIC and $\lambda=\log(n)$ for BIC, which is what is done in the l0ara R package. To me this makes more sense than what they e.g. do in the case of LASSO or elastic net regression in glmnet, where optimizing one objective (LASSO or elastic net regression) is followed by the tuning of the regularization parameter(s) based on some other objective (which e.g. minimizes cross validation prediction error, AIC or BIC).

Syed (2011) on page 10 notes "We can also try to gain an intuitive understanding of the asymptotic equivalence by noting that the AIC minimizes the Kullback-Leibler divergence between the approximate model and the true model. The Kullback-Leibler divergence is not a distance measure between distributions, but really a measure of the information loss when the approximate model is used to model the ground reality. Leave-one-out cross validation uses a maximal amount of data for training to make a prediction for one observation. That is, $n −1$ observations as stand-ins for the approximate model relative to the single observation representing “reality”. We can think of this as learning the maximal amount of information that can be gained from the data in estimating loss. Given independent and identically distributed observations, performing this over $n$ possible validation sets leads to an asymptotically unbiased estimate."

Note that the LOOCV error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation. This would always be an alternative to the AIC, as that is only an asymptotic approximation of the LOOCV error.

References

Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.