Model Selection – Understanding the Rule of Thumb for Meaningful Differences in AIC

I've been struggling to find meaningful guidelines for comparing models based on differences in AIC. I keep coming back to the rule of thumb offered by Burnham & Anderson 2004, pp. 270-272:

Some simple rules of thumb are often useful in assessing the relative merits of models in the set: Models having $\Delta i$ ≤ 2 have substantial support (evidence), those in which 4 ≤ $\Delta i$ ≤ 7 have considerably less support, and models having $\Delta i$ > 10 have essentially no support.

For example, see these questions:

I'm trying to understand the rational justification for these threshold numbers of 2, 4, 7 and 10. The Wikipedia article on Akaike Information Criterion gives some guidance on this question:

Suppose that there are R candidate models. Denote the AIC values of
those models by AIC1, AIC2, AIC3, …, AICR. Let AICmin be the minimum
of those values. Then the quantity exp((AICmin − AICi)/2) can be
interpreted as being proportional to the probability that the ith
model minimizes the (estimated) information loss.

As an example, suppose that there are three candidate models, whose
AIC values are 100, 102, and 110. Then the second model is
exp((100 − 102)/2) = 0.368 times as probable as the first model to
minimize the information loss. Similarly, the third model is
exp((100 − 110)/2) = 0.007 times as probable as the first model to
minimize the information loss.

So, based on this information, I compiled the following table:

delta_aic: Minimum AIC of all the models being compared minus AIC of the i th model under consideration. In the Wikipedia example, delta_aic for the model where AIC=100 is 0, for the model where AIC=102 is 2, and for the model where AIC=110 is 10.
prob(min vs. i): Model i is x times as probable as the minimum model to minimize the information loss. In the Wikipedia example, the model where AIC=100 is equally probable, the model where AIC=102 is 0.368 times as probable, and the model where AIC=110 is 0.007 times as probable.
prob(i vs. min): This is simply 1/prob(min vs. i). So, it means that the minimum model is x times as probable as Model i to minimize the information loss. In the Wikipedia example, the model where AIC=100 is equally probable, the minimal model is 2.7 times as probable as the model where AIC=102, and the minimal model is 148.4 times as probable as the model where AIC=110.

OK, so I think I get the mathematics somewhat, but I really don't understand what all this means in terms of practically selecting one model over another. The rule of thumb says that if the minimal model is 2.7 times as probable as another model (that is, $\Delta i$ ≤ 2), then the two models are practically equivalent. But why is 2.7 times probability in minimizing information so little as to be of no consequence? What does that even mean? Similarly, the rule of thumb says that only when you get to when the minimal model is 148.4 times as probable as another model (that is, $\Delta i$ > 10) would you say that the model under consideration can no longer be considered equivalent in any meaningful way. Isn't that an extremely huge amount of tolerance?

Even when breaking it down mathematically this way, the rule of thumb still does not make any intuitive sense to me. So, this brings me to my question:

Could someone please explain a simple logical justification for the commonly accepted rule of thumb for acceptable AIC differences?
Alternatively, could someone please provide alternative rational criteria that are better justified than this rule of thumb?

Best Answer

I encountered the same issue, and was trying to search an answer in related articles. The Burnham & Anderson 2002 (Model Selection and Multimodel Inference - A Practical Information-Theoretic Approach Second Edition) book actually used three approaches to derive these empirical numbers. As stated in Chapter 4.5 in this book, one approach (which is easiest to understand), is let $\Delta_p = AIC_{best} - AIC_{min}$ be a random variable with a sampling distribution. They have done Monte Carlo simulation studies on this variable, and the sampling distribution of this $\Delta_p$ has substantial stability and the 95th percentile of the sampling distribution of $\Delta_p$ is generally much less than 10, and in fact generally less than 7 (often closer to 4 in simple situations), as long as when observations are independent, sample sizes are large, and models are nested.

$\Delta_p > 10$ is way beyond the 95th percentile, and is thus highly unlikely to be the Kullback-Leibler best model.

In addition, they actually argued against using 2 as a rule of thumb in their Burnham, Anderson, and Huyvaert et al., 2011 paper. They said $\Delta$ in the 2-7 range have some support and should rarely be dismissed.

Best Answer

Related Solutions

Solved – Overfitting when using corrected AIC for model selection

Solved – AIC values and their use in stepwise model selection for a simple linear regression

Related Question