Solved – Can AIC be used on out-of-sample data in cross-validation to select a model over another

aiccross-validationmodel selectionmodeling

Following Gelman's 2017 publication entitled "Understanding predictive information criteria for Bayesian models" I understand cross-validation and information criteria (Bayesian information criterion and Akaike's information) can be used separately. Usually with enough sample size one would use cross-validation with some measures of predictive accuracy to select a given model over others. With lower sample sizes, AIC and BIC might be preferred on the training data (without cross-validation). My confusion is that whether AIC and BIC can be used along with cross-volition, for example can AIC and BIC be used on the left-out fold in a 10-fold cross-validation? The idea is to use out-of-sample information criteria to penalise for complexity (AIC) as well as model fit (BIC).

Best Answer

Can AIC and BIC be used on the left-out fold in a 10-fold cross-validation?

No, that would not make sense. AIC and cross validation (CV) offer estimates of the model's log-likelihood* of new, unseen data from the same population from which the current data sample has been drawn. They do it in two different ways.

AIC measures the log-likelihood of the entire sample at once, based on parameters estimated using the entire sample, and subsequently adjusts for overfitting (which occurs when estimating log-likelihood on new data by log-likelihood of the same sample on which the estimation was done) via $p$ in $\text{AIC}=-2(\text{loglik}-p)$. Here $\text{loglik}$ is the log-likelihood of the sample data according to the model and $p$ is the number of the model's degrees of freedom (a measure of the model's flexibility);
CV measures the log-likelihood on hold-out subsamples based on parameters estimated on training subsamples. Hence, there is no overfitting unlike the case of AIC.** Therefore, there is no need for replacing the CV estimates of the log-likelihood on hold-out subsamples (folds) by penalized log-likelihood such as AIC.

Analogous logic holds for BIC.

*CV can be used for other functions of the data in place of log-likelihood, too, but for comparability with AIC, I keep the discussion focused on log-likelihood.

**Actually, CV offers a slightly pessimistic estimate of the log-likelihood because training subsamples are smaller than the entire sample and hence the model has somewhat larger estimation variance than it would had it been estimated on the entire sample. In leave-one-out CV, the problem is negligible as the training subsamples are almost as large as the entire sample; in K-fold CV, the problem can be noticeable for small K but decreases as K grows.

Related Solutions

Solved – How to one empirically demonstrate in R which cross-validation methods the AIC and BIC are equivalent to

In an attempt to partially answer my own question, I read Wikipedia's description of leave-one-out cross validation

involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data.

In R code, I suspect that that would mean something like this...

resid <- rep(NA, Nobs) 
for (lcv in 1:Nobs)
    {
        data.loo <- data[-lcv,] #drop the data point that will be used for validation
        loo.model <- lm(y ~ a+b,data=data.loo) #construct a model without that data point
            resid[lcv] <- data[lcv,"y"] - (coef(loo.model)[1] + coef(loo.model)[2]*data[lcv,"a"]+coef(loo.model)[3]*data[lcv,"b"]) #compare the observed value to the value predicted by the loo model for each possible observation, and store that value
    }

... is supposed to yield values in resid that is related to the AIC. In practice the sum of squared residuals from each iteration of the LOO loop detailed above is a good predictor of the AIC for the notable.seeds, r^2 = .9776. But, elsewhere a contributor suggested that LOO should be asymptotically equivalent to the AIC (at least for linear models), so I'm a little disappointed that r^2 isn't closer to 1. Obviously this isn't really an answer - more like additional code to try to encourage someone to try to provide a better answer.

Addendum: Since AIC and BIC for models of fixed sample size only vary by a constant, the correlation of BIC to squared residuals is the same as the correaltion of AIC to squared residuals, so the approach I took above appears to be fruitless.

Solved – AIC, BIC and GCV: what is best for making decision in penalized regression methods

I think of BIC as being preferred when there is a "true" low-dimensional model, which I think is never the case in empirical work. AIC is more in line with assuming that the more data we acquire the more complex a model can be. AIC using the effective degrees of freedom, in my experience, is a very good way to select the penalty parameter $\lambda$ because it is likely to optimize model performance in a new, independent, sample.

Best Answer

Related Solutions

Solved – How to one empirically demonstrate in R which cross-validation methods the AIC and BIC are equivalent to

Solved – AIC, BIC and GCV: what is best for making decision in penalized regression methods

Related Question