Solved – How to one empirically demonstrate in R which cross-validation methods the AIC and BIC are equivalent to

aicbiccross-validationr

In a question elsewhere on this site, several answers mentioned that the AIC is equivalent to leave-one-out (LOO) cross-validation and that the BIC is equivalent to K-fold cross validation. Is there a way to empirically demonstrate this in R such that the techniques involved in LOO and K-fold are made clear and demonstrated to be equivalent to the AIC and BIC values? Well commented code would be helpful in this regard. In addition, in demonstrating the BIC please use the lme4 package. See below for a sample dataset…

library(lme4) #for the BIC function

generate.data <- function(seed)
{
    set.seed(seed) #Set a seed so the results are consistent (I hope)
    a <- rnorm(60) #predictor
    b <- rnorm(60) #predictor
    c <- rnorm(60) #predictor
    y <- rnorm(60)*3.5+a+b #the outcome is really a function of predictor a and b but not predictor c
    data <- data.frame(y,a,b,c) 
    return(data)    
}

data <- generate.data(76)
good.model <- lm(y ~ a+b,data=data)
bad.model <- lm(y ~ a+b+c,data=data)
AIC(good.model)
BIC(logLik(good.model))
AIC(bad.model)
BIC(logLik(bad.model))

Per earlier comments, below I have provided a list of seeds from 1 to 10000 in which AIC and BIC disagree. This was done by a simple search through the available seeds, but if someone could provide a way to generate data which would tend to produce divergent answers from these two information criteria it may be particularly informative.

notable.seeds <- read.csv("http://student.ucr.edu/~rpier001/res.csv")$seed

As an aside, I thought about ordering these seeds by the extent to which the AIC and BIC disagree which I've tried quantifying as the sum of the absolute differences of the AIC and BIC. For example,

AICDiff <- AIC(bad.model) - AIC(good.model) 
BICDiff <- BIC(logLik(bad.model)) - BIC(logLik(good.model))
disagreement <- sum(abs(c(AICDiff,BICDiff)))

where my disagreement metric only reasonably applies when the observations are notable. For example,

are.diff <- sum(sign(c(AICDiff,BICDiff)))
notable <- ifelse(are.diff == 0 & AICDiff != 0,TRUE,FALSE)

However in cases where AIC and BIC disagreed, the calculated disagreement value was always the same (and is a function of sample size). Looking back at how AIC and BIC are calculated I can see why this might be the case computationally, but I'm not sure why it would be the case conceptually. If someone could elucidate that issue as well, I'd appreciate it.

Best Answer

In an attempt to partially answer my own question, I read Wikipedia's description of leave-one-out cross validation

involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data.

In R code, I suspect that that would mean something like this...

resid <- rep(NA, Nobs) 
for (lcv in 1:Nobs)
    {
        data.loo <- data[-lcv,] #drop the data point that will be used for validation
        loo.model <- lm(y ~ a+b,data=data.loo) #construct a model without that data point
            resid[lcv] <- data[lcv,"y"] - (coef(loo.model)[1] + coef(loo.model)[2]*data[lcv,"a"]+coef(loo.model)[3]*data[lcv,"b"]) #compare the observed value to the value predicted by the loo model for each possible observation, and store that value
    }

... is supposed to yield values in resid that is related to the AIC. In practice the sum of squared residuals from each iteration of the LOO loop detailed above is a good predictor of the AIC for the notable.seeds, r^2 = .9776. But, elsewhere a contributor suggested that LOO should be asymptotically equivalent to the AIC (at least for linear models), so I'm a little disappointed that r^2 isn't closer to 1. Obviously this isn't really an answer - more like additional code to try to encourage someone to try to provide a better answer.

Addendum: Since AIC and BIC for models of fixed sample size only vary by a constant, the correlation of BIC to squared residuals is the same as the correaltion of AIC to squared residuals, so the approach I took above appears to be fruitless.