Solved – AIC model comparison, null model, p-value

aicmodel comparisonstatistical significance

I apologize in advance for I feel this is a question that has been asked times and times again, however it never seems to get a definite answer, or not one I understand.

Having been "raised" with blind hypothesis testing, I struggle with the notion of model comparison, in particular with AIC:

Say I have a full model, for example the effects of variables A, B and C on response y.
I calculate the AIC of each nested model, and rank them from lowest to highest AIC, and use the rule of thumb of the delta AIC<2 to select the models of interest.

What do I say if the models of interest do not feature the null model, but when I look at these models with hypothesis testing, none of the variables are the least bit significant (alpha=5%)?

In my understanding, the best models (delta AIC<2 in this case) are more parsimonious and have a better fit than the null model, which should mean that the variable(s) they contain explain significatively the data. Otherwise the null model would be the best in terms of AIC since it is the most parsimonious?

Maybe I understand it all wrong, but the issue is that my teachers are not really trained in statistics and mostly use AICs because they are popular in our field at the moment…

Thank you

Best Answer

For nested models, the AIC bears a very close relationship to likelihood-ratio testing. For nested models that only differ by a single fitted parameter, the relationship is exact. Putting aside the delta-AIC of 2 rule of thumb for now, if you simply choose between 2 models based on which has the lower AIC, then this is equivalent to basing the choice on a $\chi^2$ test p-value of 0.157. (Note that this is less stringent than a classic p < 0.05 cutoff.) For any given number of differences in fitted parameters with nested models you can easily use the AIC definition to convert the AIC difference to a corresponding log-likelihood difference for a $\chi^2$ test. So with nested models one should expect some relationship between delta-AIC values and the statistics underlying standard significance tests.

It's not completely clear from your answer, however, whether the models being evaluated are actually nested. With 3 predictors A, B, C you could in principle evaluate the null model, each predictor individually, the combinations A+B, A+C, B+C, and the full A+B+C model. If you are evaluating all those possibilities then the models can't all be nested one inside the next.

The parsimony/AIC issue is more complicated. A model close to the model with the minimum AIC could either have more or fewer fitted parameters than the minimum-AIC model. Under the interpretation of a delta-AIC value as estimating the probability of information loss with a different model, models with the same delta-AIC from the minimum model have the same strength of evidence. So choosing the most parsimonious model within a delta-AIC of 2 would be imposing an additional restriction that isn't necessarily in keeping with the information-theoretic basis of the AIC. For example, if you were to do model weighting as proposed at the above link, there would be no reason to prefer a more-parsimonious over a less-parsimonious model having the same delta-AIC.

The Wikipedia page on AIC provides one clue to what might be going on with your observations:

Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals (to determine whether the residuals seem like random) and tests of the model's predictions.

That said, you need to be careful in interpreting p values for individual coefficients in a multiple-regression model. In some circumstances (e.g., highly correlated predictors) one might have a model that is significantly different from the null overall but for which individual regression coefficients don't pass the standard p < 0.05 cutoff. And as noted in a comment, if any predictor is categorical with more than 2 levels, you have to be very careful in just looking at p-values of coefficients.

So one should certainly make sure that a model developed this way is better than a null model, whether in terms of standard significance tests or in the ability to make useful predictions on new data. The p values for the individual predictors, however, should be of less concern.

Related Question