Solved – How to compare models on the basis of AIC

aicmodel selection

We have two models that use the same method to calculate log likelihood and the AIC for one is lower than the other. However, the one with the lower AIC is far more difficult to interpret.

We are having trouble deciding if it is worth introducing the difficulty and we judged this using a percentage difference in AIC. We found that the difference between the the two AICs was only 0.7%, with the more complicated model having a 0.7% lower AIC.

  1. Is the low percentage difference between the two a good reason to avoid using the model with the lower AIC?

  2. Does the percentage of difference explain that 0.7% more information is lost in the less complicated model?

  3. Can the two models have very different results?

Best Answer

One does not compare the absolute values of two AICs (which can be like $\sim 100$ but also $\sim 1000000$), but considers their difference: $$\Delta_i=AIC_i-AIC_{\rm min},$$ where $AIC_i$ is the AIC of the $i$-th model, and $AIC_{\rm min}$ is the lowest AIC one obtains among the set of models examined (i.e., the prefered model). The rule of thumb, outlined e.g. in Burnham & Anderson 2004, is:

  1. if $\Delta_i<2$, then there is substantial support for the $i$-th model (or the evidence against it is worth only a bare mention), and the proposition that it is a proper description is highly probable;
  2. if $2<\Delta_i<4$, then there is strong support for the $i$-th model;
  3. if $4<\Delta_i<7$, then there is considerably less support for the $i$-th model;
  4. models with $\Delta_i>10$ have essentially no support.

Now, regarding the 0.7% mentioned in the question, consider two situations:

  1. $AIC_1=AIC_{\rm min}=100$ and $AIC_2$ is bigger by 0.7%: $AIC_2=100.7$. Then $\Delta_2=0.7<2$ so there is no substantial difference between the models.
  2. $AIC_1=AIC_{\rm min}=100000$ and $AIC_2$ is bigger by 0.7%: $AIC_2=100700$. Then $\Delta_2=700\gg 10$ so there is no support for the 2-nd model.

Hence, saying that the difference between AICs is 0.7% does not provide any information.

The AIC value contains scaling constants coming from the log-likelihood $\mathcal{L}$, and so $\Delta_i$ are free of such constants. One might consider $\Delta_i = AIC_i − AIC_{\rm min}$ a rescaling transformation that forces the best model to have $AIC_{\rm min} := 0$.

The formulation of AIC penalizes the use of an excessive number of parameters, hence discourages overfitting. It prefers models with fewer parameters, as long as the others do not provide a substantially better fit. AIC tries to select a model (among the examined ones) that most adequately describes reality (in the form of the data under examination). This means that in fact the model being a real description of the data is never considered. Note that AIC gives you the information which model describes the data better, it does not give any interpretation.

Personally, I would say that if you have a simple model and a complicated one that has a much lower AIC, then the simple model is not good enough. If the more complex model is really much more complicated but the $\Delta_i$ is not huge (maybe $\Delta_i<2$, maybe $\Delta_i<5$ - depends on the particular situation) I would stick to the simpler model if it's really easier to work with.

Further, you can ascribe a probability to the $i$-th model via

$$p_i=\exp\left(\frac{-\Delta_i}{2}\right),$$

which provides a relative (compared to $AIC_{\rm min}$) probability that the $i$-th models minimizes the AIC. For example, $\Delta_i=1.5$ corresponds to $p_i=0.47$ (quite high), and $\Delta_i=15$ corresponds to $p_i=0.0005$ (quite low). The first case means that there is 47% probability that the $i$-th model might in fact be a better description than the model that yielded $AIC_{\rm min}$, and in the second case this probability is only 0.05%.

Finally, regarding the formula for AIC:

$$AIC=2k-2\mathcal{L},$$

it is important to note that when two models with similar $\mathcal{L}$ are considered, the $\Delta_i$ depends solely on the number of parameters due to the $2k$ term. Hence, when $\frac{\Delta_i}{2\Delta k} < 1$, the relative improvement is due to actual improvement of the fit, not to increasing the number of parameters only.

TL;DR

  1. It's a bad reason; use the difference between the absolute values of the AICs.
  2. The percentage says nothing.
  3. Not possible to answer this question due to no information on the models, data, and what does different results mean.