Solved – AIC model comparison, null model, p-value

aicmodel comparisonstatistical significance

I apologize in advance for I feel this is a question that has been asked times and times again, however it never seems to get a definite answer, or not one I understand.

Having been "raised" with blind hypothesis testing, I struggle with the notion of model comparison, in particular with AIC:

Say I have a full model, for example the effects of variables A, B and C on response y.
I calculate the AIC of each nested model, and rank them from lowest to highest AIC, and use the rule of thumb of the delta AIC<2 to select the models of interest.

What do I say if the models of interest do not feature the null model, but when I look at these models with hypothesis testing, none of the variables are the least bit significant (alpha=5%)?

In my understanding, the best models (delta AIC<2 in this case) are more parsimonious and have a better fit than the null model, which should mean that the variable(s) they contain explain significatively the data. Otherwise the null model would be the best in terms of AIC since it is the most parsimonious?

Maybe I understand it all wrong, but the issue is that my teachers are not really trained in statistics and mostly use AICs because they are popular in our field at the moment…

Thank you

Best Answer

For nested models, the AIC bears a very close relationship to likelihood-ratio testing. For nested models that only differ by a single fitted parameter, the relationship is exact. Putting aside the delta-AIC of 2 rule of thumb for now, if you simply choose between 2 models based on which has the lower AIC, then this is equivalent to basing the choice on a $\chi^2$ test p-value of 0.157. (Note that this is less stringent than a classic p < 0.05 cutoff.) For any given number of differences in fitted parameters with nested models you can easily use the AIC definition to convert the AIC difference to a corresponding log-likelihood difference for a $\chi^2$ test. So with nested models one should expect some relationship between delta-AIC values and the statistics underlying standard significance tests.

It's not completely clear from your answer, however, whether the models being evaluated are actually nested. With 3 predictors A, B, C you could in principle evaluate the null model, each predictor individually, the combinations A+B, A+C, B+C, and the full A+B+C model. If you are evaluating all those possibilities then the models can't all be nested one inside the next.

The parsimony/AIC issue is more complicated. A model close to the model with the minimum AIC could either have more or fewer fitted parameters than the minimum-AIC model. Under the interpretation of a delta-AIC value as estimating the probability of information loss with a different model, models with the same delta-AIC from the minimum model have the same strength of evidence. So choosing the most parsimonious model within a delta-AIC of 2 would be imposing an additional restriction that isn't necessarily in keeping with the information-theoretic basis of the AIC. For example, if you were to do model weighting as proposed at the above link, there would be no reason to prefer a more-parsimonious over a less-parsimonious model having the same delta-AIC.

The Wikipedia page on AIC provides one clue to what might be going on with your observations:

Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals (to determine whether the residuals seem like random) and tests of the model's predictions.

That said, you need to be careful in interpreting p values for individual coefficients in a multiple-regression model. In some circumstances (e.g., highly correlated predictors) one might have a model that is significantly different from the null overall but for which individual regression coefficients don't pass the standard p < 0.05 cutoff. And as noted in a comment, if any predictor is categorical with more than 2 levels, you have to be very careful in just looking at p-values of coefficients.

So one should certainly make sure that a model developed this way is better than a null model, whether in terms of standard significance tests or in the ability to make useful predictions on new data. The p values for the individual predictors, however, should be of less concern.

Related Solutions

Solved – Negative binomial GLM, the most complex model always has lowest AIC (all interaction terms)

Here are some options and things to consider:

First, AIC is a somewhat naive variable selection method. The fact that your model is not improved (in terms of AIC) by removing any of the variables suggests that maybe you don't want to remove anything.

Second, consider the purpose of this kind of model selection. You only have three inputs, so even with interaction terms you're not significantly improving the computational efficiency of your model by dropping terms. Moreover, it looks like you haven't evaluated whether or not you're overfitting at all, which I would suggest is the main benefit of seeking parsimonious models. You can certainly use BIC to try and find a sparser model, but you should compare that model against the full model using some other criterion, like cross validation or a hold out test set.

Third, this is the model recommended via backwards elimination. You can try building up your model via forward selection instead and see if you don't get a different (perhaps better) result. Something like this:

null.model = glm(responsevar~1, data=df)
full.model = glm(responsevar~Cov1*Cov2*Cov3, data=df)
step(null.model
    ,scope=list(lower=null.model, upper=full.model)
    ,direction="forward")

Keep in mind: stepwise regression is a greedy algorithm and is prone to getting stuck in local optima.

Finally, stepwise regression isn't your only option for model selection. You could use PCA to try to determine which variables are most significant, or use alternative regularization techniques like ridge regression (L2 regularization which performs coefficient shrinkage) or lasso regression (L1 regularization which performs both variable selection and coefficient shrinkage).

library(glmnet)
lasso.model = glmnet(x, y)
ridge.model = glmnet(x, y, alpha=0)

These are of course just a handful of the various available options to you. It's entirely possible that the model you already have is perfectly suitable. Is there any particular reason you suspect that it's not?

Model Selection with AIC – Comparing AIC from Linear and Poisson Models for Count Data

You cannot use likelihood-based statistics like AIC to compare across models with different likelihood functions - the underlying formulas are different. In linear regression, the likelihood function is the normal density function, in Poisson regression it is the Poisson function. That will account for the differences in the AIC probably more than any differences in fit.

Before you decide to even use a linear model, you need to make sure that the residuals from the model are normally distributed (you can proxy that by looking at the distribution of the outcome variable, though keep in mind it isn't the same). If they are not normally distributed, or close enough for the eye, then you can't use a normal regression model to do any hypothesis testing.

Assuming that it is approximately normal, I would take a two broad approaches to choose the model to report.

1) Predicted outcomes. Estimate the predicted outcomes of each model and compare. Does the linear model have better predictive ability? You may want to do this in a cross-validation framework, where you "train" your model on part of your data and use the rest for prediction.

2) Intuitive interpretation of coefficients. Poisson coefficients can be complicated to understand - they are not the change in number of y but rather a proportional change. Depending on your context this may be more or less useful. Sometimes it is worth sacrificing fit if your model can be more easily interpreted by the end-user - for example, some researchers are willing to avoid the complexity of logit and probit models for the easier-to-interpret coefficients in a linear probability model, even though the LPM has tons of setbacks. Think about who your audience is, what is your context, what is your research question, etc., as you make these decisions.

EDIT: I forgot to add this paper, which gives a good comparison across a range of different count models and may be helpful.

Best Answer

Related Solutions

Solved – Negative binomial GLM, the most complex model always has lowest AIC (all interaction terms)

Model Selection with AIC – Comparing AIC from Linear and Poisson Models for Count Data

Related Question