Solved – Negative binomial GLM, the most complex model always has lowest AIC (all interaction terms)

aicbicgeneralized linear modelmodelr

I apologize for all those questions on modelling. It is the very first time that I try GLM and I am really lost even after reading a lot of papers.

I have divided my covariates according to their theme e.g. topographic variables together. I was trying to first get the most parsimonious model for each set of variables. I am using the step function and I am running GLM negative binomial models. My continuous variables were scaled.

Now, my problem is that whatever model I run the most complicated one is always the one with the smallest AIC. For example if I run :

glm(responsevar~Cov1*Cov2*Cov3)

the best model will be :
responsevar~Cov1+Cov2+Cov3+Cov1:Cov2+Cov1:Cov3+Cov2:Cov3+Cov1:Cov2:Cov3

My sample size is 2780. Would it be better to use BIC to select the models? Should I limit the interactions to what I think is coherent?

Best Answer

Here are some options and things to consider:

First, AIC is a somewhat naive variable selection method. The fact that your model is not improved (in terms of AIC) by removing any of the variables suggests that maybe you don't want to remove anything.

Second, consider the purpose of this kind of model selection. You only have three inputs, so even with interaction terms you're not significantly improving the computational efficiency of your model by dropping terms. Moreover, it looks like you haven't evaluated whether or not you're overfitting at all, which I would suggest is the main benefit of seeking parsimonious models. You can certainly use BIC to try and find a sparser model, but you should compare that model against the full model using some other criterion, like cross validation or a hold out test set.

Third, this is the model recommended via backwards elimination. You can try building up your model via forward selection instead and see if you don't get a different (perhaps better) result. Something like this:

null.model = glm(responsevar~1, data=df)
full.model = glm(responsevar~Cov1*Cov2*Cov3, data=df)
step(null.model
    ,scope=list(lower=null.model, upper=full.model)
    ,direction="forward")

Keep in mind: stepwise regression is a greedy algorithm and is prone to getting stuck in local optima.

Finally, stepwise regression isn't your only option for model selection. You could use PCA to try to determine which variables are most significant, or use alternative regularization techniques like ridge regression (L2 regularization which performs coefficient shrinkage) or lasso regression (L1 regularization which performs both variable selection and coefficient shrinkage).

library(glmnet)
lasso.model = glmnet(x, y)
ridge.model = glmnet(x, y, alpha=0)

These are of course just a handful of the various available options to you. It's entirely possible that the model you already have is perfectly suitable. Is there any particular reason you suspect that it's not?