Solved – If the AIC and the BIC are asymptotically equivalent to cross validation, is it possible to dispense with a test set when using them

aicbiccross-validation

Several sources I've come across state that the AIC and the BIC are asymptotically equivalent to cross-validation (see multiple answers here for example, and here), .

When training a predictive model, is this useful in practice, or is it just a theoretical consideration?

Empirically, is there a sample size above which we can dispense with using a test set and use all the data we have for training, given this asymptotic equivalence?

I've recently come across a professional demand forecasting which does automated forecast generation. Based on the documentation, they don't split their time series into train and test, they simply tain a bunch of models and then just take the model with the best BIC – which seems to hint at the idea that the AIC and BIC are useful in practice, but I'm not sure.

Could the above mentioned equivalence between BIC and CV be why this package works this way – or is there another reason why they dispense with train/test split and simply go with the BIC as a model selection criteria?

(note that this package is designed to handle millions of time series concurrently)

Best Answer

AIC is asymptotically equivalent to leave-1-out cross-validation (LOOCV) (Stone 1977) and BIC is equivalent to leave-k-out cross-validation (LKOCV) where $k=n[1−1/(\log(n)−1)]$, with $n=$ sample size (Shao 1997). So if you are happy with LOOCV or LKOCV in terms of optimizing model prediction error and consistency of selection, respectively, then yes you could in principle get rid of splitting the data in a training & test set. Note that within the context of L0-penalized GLMs (where you penalize the log-likelihood of your model based on lambda * the nr of nonzero coefficients, i.e. the L0-norm of your model coefficients) you can also optimize the AIC or BIC objective directly, as $\lambda = 2$ for AIC and $\lambda=\log(n)$ for BIC, which is what is done in the l0ara R package. To me this makes more sense than what they e.g. do in the case of LASSO or elastic net regression in glmnet, where optimizing one objective (LASSO or elastic net regression) is followed by the tuning of the regularization parameter(s) based on some other objective (which e.g. minimizes cross validation prediction error, AIC or BIC).

Syed (2011) on page 10 notes "We can also try to gain an intuitive understanding of the asymptotic equivalence by noting that the AIC minimizes the Kullback-Leibler divergence between the approximate model and the true model. The Kullback-Leibler divergence is not a distance measure between distributions, but really a measure of the information loss when the approximate model is used to model the ground reality. Leave-one-out cross validation uses a maximal amount of data for training to make a prediction for one observation. That is, $n −1$ observations as stand-ins for the approximate model relative to the single observation representing “reality”. We can think of this as learning the maximal amount of information that can be gained from the data in estimating loss. Given independent and identically distributed observations, performing this over $n$ possible validation sets leads to an asymptotically unbiased estimate."

Note that the LOOCV error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation. This would always be an alternative to the AIC, as that is only an asymptotic approximation of the LOOCV error.

References

Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.

Related Solutions

Solved – How to one empirically demonstrate in R which cross-validation methods the AIC and BIC are equivalent to

In an attempt to partially answer my own question, I read Wikipedia's description of leave-one-out cross validation

involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data.

In R code, I suspect that that would mean something like this...

resid <- rep(NA, Nobs) 
for (lcv in 1:Nobs)
    {
        data.loo <- data[-lcv,] #drop the data point that will be used for validation
        loo.model <- lm(y ~ a+b,data=data.loo) #construct a model without that data point
            resid[lcv] <- data[lcv,"y"] - (coef(loo.model)[1] + coef(loo.model)[2]*data[lcv,"a"]+coef(loo.model)[3]*data[lcv,"b"]) #compare the observed value to the value predicted by the loo model for each possible observation, and store that value
    }

... is supposed to yield values in resid that is related to the AIC. In practice the sum of squared residuals from each iteration of the LOO loop detailed above is a good predictor of the AIC for the notable.seeds, r^2 = .9776. But, elsewhere a contributor suggested that LOO should be asymptotically equivalent to the AIC (at least for linear models), so I'm a little disappointed that r^2 isn't closer to 1. Obviously this isn't really an answer - more like additional code to try to encourage someone to try to provide a better answer.

Addendum: Since AIC and BIC for models of fixed sample size only vary by a constant, the correlation of BIC to squared residuals is the same as the correaltion of AIC to squared residuals, so the approach I took above appears to be fruitless.

Solved – Cross validation with test data set

Let's look at three different approaches

In the simplest scenario one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model. However, this scenario is not possible for many researchers given time or cost limitations.
If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the side (completely untouched by the training). This is to simulate it as a completely independent dataset set even though it comes from the same dataset but the model training won't take any information from those samples. You would then build your model on the remaining training samples and then test on these left-out samples.
If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for model building. As such, the validation is performed on every fold (k-fold CV?) and your validation metric would be aggregated across each validation.

To more directly answer your question, yes you can just do cross-validation on your full dataset. You can then use your predicted and actual classes to evaluate your models performance by whatever metric you prefer (Accuracy, AUC, etc.)

That said, you still probably want to look in to repeated cross-validation to evaluate the stability of your model. Some good answers regarding this are here on internal vs. external CV and here on the # of repeats

Best Answer

Related Solutions

Solved – How to one empirically demonstrate in R which cross-validation methods the AIC and BIC are equivalent to

Solved – Cross validation with test data set

Related Question