Solved – Incorporating random effects in the logistic regression formula in R

logisticrstepwise regression

I'm trying to find the best model based on AIC using the stepwise (direction = both) model selection in R using the stepAIC in MASS package.

This is the script i used:

stepAIC (glmer(decision ~ as.factor(Age) + as.factor(Educ) + as.factor(Child), family=binomial, data=RShifting), direction="both")

however I got this error result:

Error in lmerFactorList(formula, fr, 0L, 0L) : 
  No random effects terms specified in formula

I tried to add (1|town) to the formula since town is the random effect (where the respondents are nested) and ran this script):

stepAIC (glmer(decision ~ as.factor(Age) + as.factor(Educ) + as.factor(Child) + (1|town), family=binomial, data=RShifting), direction="both")

The result is this:

Error in x$terms : $ operator not defined for this S4 class

I hope you could help me figure out how to solve this problem. Thanks a lot.

Best Answer

Short answer is you can't - well, not without recoding a version of stepAIC() that knows how to handle S4 objects. stepAIC() knows nothing about lmer() and glmer() models, and there is no equivalent code in lme4 that will allow you to do this sort of stepping.

I also think your whole process needs carefully rethinking - why should there be the one best model? AIC could be used to identify several candidate models that do similar jobs and average those models, rather than trying to find the best model for your sample of data.

Selection via AIC is effectively doing multiple testing - but how should you correct the AIC to take into account the fact that you are doing all this testing? How do you interpret the precision of the coefficients for the final model you might select?

A final point; don;t do all the as.factor() in the model formula as it just makes the whole thing a mess, takes up a lot of space and doesn't aid understanding of the model you fitted. Get the data in the correct format first, then fit the model, e.g.:

RShifting <- transform(RShifting,
                       Age = as.factor(Age),
                       Educ = as.factor(Educ),
                       Child = as.factor(Child))

then

glmer(decision ~ Age + Educ + Child + (1|town), family=binomial, 
      data=RShifting)

Apart from making things far more readable, it separates the tasks of data processing from the data analysis steps.

Related Solutions

Solved – Stepwise model selection, Hosmer-Lemeshow statistics and prediction success of model in nested logistic regression in R

I really appreciate the pointers to my book and papers and R package. Briefly, stepwise regression is invalid as it destroys all statistical properties of the result as well as faring poorly in predictive accuracy. There is no reason to use ROC curves to guide model selection (if model selection is even a good idea), because we have the optimum measure, the log-likelihood and its variants such as AIC. Thresholds for the dependent variable should be dealt with using ordinal regression instead of making a series of binary models. The Hosmer-Lemeshow test is now considered obsolete by many statisticians as well as the original authors. See the reference below (which proposes a better method, implemented in the rms package).

@ARTICLE{hos97com, author = {Hosmer, D. W. and Hosmer, T. and {le Cessie}, S. and Lemeshow, S.}, year = 1997, title = {A comparison of goodness-of-fit tests for the logistic regression model}, journal = Statistics in Medicine, volume = 16, pages = {965-980}, annote = {goodness-of-fit for binary logistic model;difficulty with Hosmer-Lemeshow statistic being dependent on how groups are defined;sum of squares test (see cop89unw);cumulative sum test;invalidity of naive test based on deviance;goodness-of-link function;simulation setup;see sta09sim} }

Solved – Coding of categorical random effects in R: int vs factor

The difference is because you're separating the intercept and the slope in the random effect. That's an odd thing to do; the usual way to fit this model would be

OK ~ multi + (multi | item) + (1 | subject)

with multi being a factor.

What happens is that in the first model you get what you expect; the 0+multi|item term gives one parameter and the 1|item term gives one parameter, but in the second model the 0 + multi | item term results in two parameters, which are simply the estimate for each condition. If you take the 1|item term out of that model you should get a result that is equivalent to both your first model and the one I give above, except for differences in parameterization.

Note also the correlation of exactly one in your second model; this is a clue that you've overparameterized it and that one of those parameters is not necessary.

Best Answer

Related Solutions

Solved – Stepwise model selection, Hosmer-Lemeshow statistics and prediction success of model in nested logistic regression in R

Solved – Coding of categorical random effects in R: int vs factor

Related Question