Solved – Generalized linear mixed models: model selection

aicglmmmixed modelmodel selectionstepwise regression

This question/topic came up in a discussion with a colleague and I was looking for some opinions on this:

I am modeling some data using a random effects logistic regression, more precisely a random intercept logistic regression. For the fixed effects I have 9 variables that are of interest and come into consideration. I would like to do some sort of model selection to find the variables that are significant and give the “best” model (main effects only).

My first idea was to use the AIC to compare different models but with 9 variables I was not too exciting to compare 2^9=512 different models (keyword: data dredging).

I discussed this with a colleague and he told me that he remembered reading about using stepwise (or forward) model selection with GLMMs. But instead of using a p-value (e.g. based on a likelihood ratio test for GLMMs), one should use the AIC as entry/exit criterion.

I found this idea very interesting, but I did not find any references that further discussed this and my colleague did not remember where he read it. Many books suggest using the AIC to compare models but I did not find any discussion about using this together with a stepwise or forward model selection procedure.

So I have basically two questions:

  1. Is there anything wrong with using the AIC in a stepwise model selection procedure as entry/exit criterion? If yes, what would be the alternative?

  2. Do you have some references that discuss the above procedure that (also as reference for a final report?

Best,

Emilia

Best Answer

Stepwise selection is wrong in multilevel models for the same reasons it is wrong in "regular" regression: The p-values will be too low, the standard errors too small, the parameter estimates biased away from 0 etc. Most important, it denies you the opportunity to think.

9 IVs is not so very many. Why did you choose those 9? Surely you had a reason.

One initial thing to do is look at a lot of plots; which precise ones depends a little on whether your data are longitudinal (in which case plots with time on the x-axis are often useful) or clustered. But surely look at relationships between the 9 IVs and your DV (parallel box plots are one simple possibility).

The ideal would be to build a few models based on substantive sense and compare them using AIC, BIC or some other measure. But don't be surprised if no particular model comes forth as clearly best. You don't say what field you work in, but in many (most?) fields, nature is complicated. Several models may fit about equally well and a different model may fit better on a different data set (even if both are random samples from the same population).

As for references - there are lots of good books on nonlinear mixed models. Which one is best for you depends on a) What field you are in b) What the nature of the data is c) What software you use.

Responding to your comment

  1. If all 9 variables are scientifically important, I would at least consider including them all. If a variable that everyone thinks is important winds up having a small effect, that is interesting.

  2. Certainly plot all your variables over time and in various ways.

  3. For general issues about longitudinal multilevel models I like Hedeker and Gibbons; for nonlinear longitudinal models in SAS I like Molenberghs and Verbeke. The SAS documentation itself (for PROC GLIMMIX) also provides guidance.