Solved – Variable selection in mixed effect models (stepAIC followed by dredge followed by model averaging)

feature selectionmixed modelr

I have simple mixed model of the form:

fit <- nlme:::lme(fixed= Outcome ~ Var1+Var2+...+Varn, random=(1|cluster), data=data)

i have n=37 environmental variables, six clusters with 30 measurements each for a total of 180 measurements/rows. Outcome is relative abundance of biological species, which are modelled one by one assuming no interaction between them (a bold assumption, i guess).
I proceed to select the most important variables with the following, easy to use approach:
First, i do a pre-selection with MASS:::stepAIC() (alternatively i could just use the p-values in summary(fit), which leaves me at 15 predictor variables, then i subject this updated model to MuMIn:::dredge() for an even sparser model, and then do model averaging on the top models based on AICc:

fit <- lme(fixed: Outcome ~ Var1+Var2+...+Varn, random=(1|cluster), data=data)
bestfit <- stepAIC(fit)
dredgedfit <- dredge(bestfit)
summary(model.avg(dredgedfit, subset=...))

Does this approach make sense? (I am not an expert on mixed models in any way). Will it identify meaningful variables and will it pass a review process?

Also; there is a slight improvement of fit when i apply a spatial correlation structure (delta AIC=1). Shall this go directly into fit, or can i apply it later on the averaged model?

Best Answer

It is not clear what you are trying to do in the end with the final model averaged model. So my comments are mostly in terms of prediction and/or interpreting covariates in the model.

The outlined approach sounds like it could lead to considerable over-fitting/hypothesis tests with a level deviating from the specified one/poorly calibrated or overconfident predictions. For a start, the initial stepwise model selection is ignored in the final model averaging and so is - it seems - the uncertainty about the spatial correlation structure. One approach that would avoid that is to do model averaging across all the considered model (i.e. without first doing stepwise) including with equal correlation across site and with a more complex spatial correlation structure. Trying to ignore any of the model uncertainty ignores part of the uncertainty you should have at the end (in predictions, in coefficients etc.).

Another think to consider is that depending on what you wish to say at the end, you may still have a multiplicity issue (i.e. if you want to say something like "covariate X is significant, covariate Y is significant etc."), which you did not say anything about.

Of course you are working with not much data and a huge space of potential models. So, it may be tempting to reduce that somehow, but one of the best tools for doing that is up-front thinking rather than stepwise (or other) model selection. The documentation to the dredge function even provides the following nice quote (I did not verify the exact quote in the book):

' "Let the computer find out" is a poor strategy and usually reflects the fact that the researcher did not bother to think clearly about the problem of interest and its scientific setting (Burnham and Anderson, 2002).'

Related Question