Solved – AIC,BIC,CIC,DIC,EIC,FIC,GIC,HIC,IIC — Can I use them interchangeably

aicbicforecastingmodel selection

On p. 34 of his PRNN Brian Ripley comments that "The AIC was named by Akaike (1974) as 'An Information Criterion' although it seems commonly believed that the A stands for Akaike". Indeed, when introducing the AIC statistic, Akaike (1974, p.719) explains that

"IC stands for information criterion and A is added so that similar statistics, BIC, DIC
etc may follow".

Considering this quotation as a prediction made in 1974, it is interesting to note
that in just four years two types of the BIC statistic (Bayesian IC) were proposed by Akaike (1977, 1978) and Schwarz (1978). It took Spiegelhalter et al. (2002) much
longer to come up with DIC (Deviance IC). While the appearance of the CIC criterion
was not predicted by Akaike (1974), it would be naive to believe that it was never
contemplated. It was proposed by Carlos C. Rodriguez in 2005. (Note that R. Tibshirani
and K. Knight's CIC (Covariance Inflation Criterion) is a different thing.)

I knew that EIC (Empirical IC) was proposed by people of Monash University in around 2003.
I've just discovered the Focused Information Criterion (FIC). Some books refer to Hannan and Quinn IC as HIC, see e.g. this one). I know there should be GIC (Generalised IC) and I've just discovered the Information Investing Criterion (IIC). There is NIC, TIC and more.

I think I could possibly cover the rest of the alphabet, so I am not asking where the sequence AIC,BIC,CIC,DIC,EIC,FIC,GIC,HIC,IIC,… stops, or what letters of the alphabet have not been used or been used at least twice (e.g. the E in EIC can stand for either Extended or Empirical). My question is simpler and I hope more practically useful. Can I use those statistics interchangeably, ignoring the specific assumptions they were derived under, the specific situations they were meant to be applicable in, and so on?

This question is partly motivated by Burnham & Anderson (2001) writing that:

...the comparison of AIC and BIC model selection ought to be based on their performance 
properties such as mean square error for parameter estimation (includes prediction) and 
confidence interval coverage: tapering effects or not, goodness-of-fit issues, 
derivation of theory is irrelevant as it can be frequentist or Bayes. 

Chapter 7 of Hyndman et al.'s monograph on exponential smoothing appears to follow
the B-A advice when looking into how well the five alternative ICs (AIC, BIC, AICc, HQIC, LEIC) perform in selecting the model that forecasts best (as measured by a newly proposed error measure called MASE) to conclude that the AIC was a better alternative more often. (The HQIC was reported as the best model selector just once.)

I am not sure what is the useful purpose of the research exercises that implicitly treat all
ICc as though they were derived to answer one and the same question under equivalent sets of assumptions. In particular, I am not sure how it is useful to investigate the predictive performance of the consistent criterion for determining the order of an autoregression (that Hannan and Quinn derived for ergodic stationary sequences) by using it in the context of the non-stationary exponentially smoothing models described and analysed in the monograph by Hyndman et al. Am I missing something here?

References:

Akaike, H. (1974), A new look at the statistical model identification, IEEE Transactions
on Automatic Control
19(6), 716-723.

Akaike, H. (1977), On entropy maximization principle, in P. R. Krishnaiah, ed., Applications
of statistics
, Vol. 27, Amsterdam: North Holland, pp. 27-41.

Akaike, H. (1978), A Bayesian analysis of the minimum AIC procedure, Annals of the
Institute of Statistical Mathematics
30(1), 9-14.

Burnham, K. P. & Anderson, D. R. (2001) Kullback–Leibler information as a basis for strong
inference in ecological studies, Wildlife Research 28, 111-119

Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. Forecasting with exponential smoothing:
the state space approach.
New York: Springer, 2008

Ripley, B.D. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press, 1996

Schwarz, G. (1978), Estimating the dimension of a model, Annals of Statistics 6(2), 461-464.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002), Bayesian
measures of model complexity and t (with discussion), Journal of the Royal Statistical
Society. Series B (Statistical Methodology)
64(4), 583-639.

Best Answer

My understanding is that AIC, DIC, and WAIC are all estimating the same thing: the expected out-of-sample deviance associated with a model. This is also the same thing that cross-validation estimates. In Gelman et al. (2013), they say this explicitly:

A natural way to estimate out-of-sample prediction error is cross-validation (see Vehtari and Lampinen, 2002, for a Bayesian perspective), but researchers have always sought alternative mea- sures, as cross-validation requires repeated model fits and can run into trouble with sparse data. For practical reasons alone, there remains a place for simple bias corrections such as AIC (Akaike, 1973), DIC (Spiegelhalter, Best, Carlin, and van der Linde, 2002, van der Linde, 2005), and, more recently, WAIC (Watanabe, 2010), and all these can be viewed as approximations to different versions of cross-validation (Stone, 1977).

BIC estimates something different, which is related to minimum description length. Gelman et al. say:

BIC and its variants differ from the other information criteria considered here in being motivated not by an estimation of predictive fit but by the goal of approximating the marginal probability density of the data, p(y), under the model, which can be used to estimate relative posterior probabilities in a setting of discrete model comparison.

I don't know anything about the other information criteria you listed, unfortunately.

Can you use the AIC-like information criteria interchangeably? Opinions may differ, but given that AIC, DIC, WAIC, and cross-validation all estimate the same thing, then yes, they're more-or-less interchangeable. BIC is different, as noted above. I don't know about the others.

Why have more than one?

  • AIC works well when you have a maximum likelihood estimate and flat priors, but doesn't really have anything to say about other scenarios. The penalty is also too small when the number of parameters approaches the number of data points. AICc over-corrects for this, which can be good or bad depending on your perspective.

  • DIC uses a smaller penalty if parts of the model are heavily constrained by priors (e.g. in some multi-level models where variance components are estimated). This is good, since heavily constrained parameters don't really constitute a full degree of freedom. Unfortunately, the formulas usually used for DIC assume that the posterior is essentially Gaussian (i.e. that it is well-described by its mean), and so one can get strange results (e.g. negative penalties) in some situations.

  • WAIC uses the whole posterior density more effectively than DIC does, so Gelman et al. prefer it although it can be a pain to calculate in some cases.

  • Cross-validation does not rely on any particular formula, but it can be computationally prohibitive for many models.

In my view the decision about which one of the AIC-like criteria to use depends entirely on these sorts of practical issues, rather than a mathematical proof that one will do better than the other.

References:

Gelman et al. Understanding predictive information criteria for Bayesian models. Available from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.3501&rep=rep1&type=pdf