Solved – Can AIC be used on out-of-sample data in cross-validation to select a model over another

aiccross-validationmodel selectionmodeling

Following Gelman's 2017 publication entitled "Understanding predictive information criteria for Bayesian models" I understand cross-validation and information criteria (Bayesian information criterion and Akaike's information) can be used separately. Usually with enough sample size one would use cross-validation with some measures of predictive accuracy to select a given model over others. With lower sample sizes, AIC and BIC might be preferred on the training data (without cross-validation). My confusion is that whether AIC and BIC can be used along with cross-volition, for example can AIC and BIC be used on the left-out fold in a 10-fold cross-validation? The idea is to use out-of-sample information criteria to penalise for complexity (AIC) as well as model fit (BIC).

Best Answer

Can AIC and BIC be used on the left-out fold in a 10-fold cross-validation?

No, that would not make sense. AIC and cross validation (CV) offer estimates of the model's log-likelihood* of new, unseen data from the same population from which the current data sample has been drawn. They do it in two different ways.

  1. AIC measures the log-likelihood of the entire sample at once, based on parameters estimated using the entire sample, and subsequently adjusts for overfitting (which occurs when estimating log-likelihood on new data by log-likelihood of the same sample on which the estimation was done) via $p$ in $\text{AIC}=-2(\text{loglik}-p)$. Here $\text{loglik}$ is the log-likelihood of the sample data according to the model and $p$ is the number of the model's degrees of freedom (a measure of the model's flexibility);
  2. CV measures the log-likelihood on hold-out subsamples based on parameters estimated on training subsamples. Hence, there is no overfitting unlike the case of AIC.** Therefore, there is no need for replacing the CV estimates of the log-likelihood on hold-out subsamples (folds) by penalized log-likelihood such as AIC.

Analogous logic holds for BIC.

*CV can be used for other functions of the data in place of log-likelihood, too, but for comparability with AIC, I keep the discussion focused on log-likelihood.

**Actually, CV offers a slightly pessimistic estimate of the log-likelihood because training subsamples are smaller than the entire sample and hence the model has somewhat larger estimation variance than it would had it been estimated on the entire sample. In leave-one-out CV, the problem is negligible as the training subsamples are almost as large as the entire sample; in K-fold CV, the problem can be noticeable for small K but decreases as K grows.