Solved – Mixed models and backward elimination

lme4-nlmemixed modelmodel selection

Let's say I have a data like this, and I'm trying to build a mixed model.

studentId   | courseId | courseName | year | courseGroup | timespent | count | mark
stud1       | 19       | M101       | 2008 | F           | 12.3      | 23    | 3.7
stud1       | 21       | E102       | 2008 | C           | 2.3       | 15    | 4
stud1       | 109      | H300       | 2008 | E           | 22.3      | 5     | 3
stud2       | 19       | M101       | 2008 | F           | 3.3       | 45    | 3
stud2       | 21       | E102       | 2008 | C           | 12.3      | 56    | 3.3
stud3       | 200      | M101       | 2009 | F           | 12.3      | 21    | 3.7

the full model would be:

lmer.model.full <- mark ~ courseGroup + timespent + count + courseGroup:(timespent+count) + (1|studentId) + (1|courseName/courseId)

Running the model results with the following summary:

Fixed effects:
                         Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)             3.909e+00  1.282e-01  8.089e+02  30.491  < 2e-16 ***
courseGroupE           -2.404e-01  1.835e-01  6.417e+02  -1.311  0.19049    
courseGroupF           -1.105e-01  1.493e-01  2.246e+02  -0.740  0.46008    
timespent              -1.552e-02  5.065e-02  1.872e+03  -0.306  0.75932    
count                  -5.244e-02  5.409e-02  1.869e+03  -0.969  0.33244    
courseGroupE:timespent  8.740e-02  5.184e-02  1.823e+03   1.686  0.09196 .  
courseGroupF:timespent  2.350e-03  3.992e-02  1.881e+03   0.059  0.95308    
courseGroupE:count     -6.546e-02  2.673e-02  1.158e+03  -2.449  0.01446 *  
courseGroupF:count     -7.015e-02  2.470e-02  1.373e+03  -2.840  0.00457 ** 

In order to proceed with the model selection using backward elimination, I should continue by removing the one with the highest p-value, which is timespent. But, the interaction between timespent and courseGroup is marginally significant, might become significant in later iterations. On the other hand, p-value for interaction between F group and timespent is 0.95. What should I do in this case and how to proceed further?

One more thing that puzzle me… what should I be doing in this case:

 Fixed effects:
                             Estimate Std. Error         df t value Pr(>|t|)    
    (Intercept)             3.909e+00  1.282e-01  8.089e+02  30.491  < 2e-16 ***
    courseGroupE           -2.404e-01  1.835e-01  6.417e+02  -1.311  0.16049    
    courseGroupF           -1.105e-01  1.493e-01  2.246e+02  -0.740  0.96008    
    timespent              -1.552e-02  5.065e-02  1.872e+03  -0.306  0.25932    
    count                  -5.244e-02  5.409e-02  1.869e+03  -0.969  0.33244    
    courseGroupE:timespent  8.740e-02  5.184e-02  1.823e+03   1.686  0.09196 .  
    courseGroupF:timespent  2.350e-03  3.992e-02  1.881e+03   0.059  0.25308    
    courseGroupE:count     -6.546e-02  2.673e-02  1.158e+03  -2.449  0.01446 *  
    courseGroupF:count     -7.015e-02  2.470e-02  1.373e+03  -2.840  0.00457 ** 

Should I remove courseGroup from the model, leaving only with timespent and count, or remove count? What would be the best indicator that courseGroup should be excluded from the model?

Best Answer

The $p$-values you are getting are based on Satterthwaite's approximation of degrees of freedom. This approximation is conducted because the degrees of freedom (DFs) in the context of a mixed-effects model are not obvious. Other people might advocate different approximations (eg. Kenward-Roger approximation); realistically speaking seeing the DFs you are looking at (200+ most of the time), the difference you will get from one approximation to another they will be quite insignificant. Your $p$-values will essentially be $z$-values as DFs >= 500. To state the obvious though: $p$-values are not the probability of making a mistake by rejecting a true null hypothesis. Therefore a small value does not mandate that the associated variable should be included.

Having said the above please see the following thread Algorithms for automatic model selection. You are exactly in the same situation. If you really want to go down the road of a stepwise (linear mixed-effects) regression please use AIC; this option will try to somewhat penalize your model's complexity. Some standard caveats about the use of AIC on mixed-effects model can be found in the FAQ of http://glmm.wikidot.com. You should not use AIC's "vanilla" version directly. Please look at Greven and Kneib, 2010 regrading this; they present a corrected cAIC. They also provide an R package implementing the corrected cAIC they outline.

In general I would suggest you look into cross-validation techniques and/or look at boostrapping your model. Check the functionality around lmer's bootMer, inspect the confidence intervals of your parameters; it should give you a better idea of what is worth including in your final model. I would argue that even jack-knifing is better than $p$-value selection.

Finally while I "get" the idea of variable selection in terms of fixed-effects, in terms of random-effects the whole idea appears to me as a simple data-dredging technique. The random effects are suppose to be based on the research question. Otherwise one simply cherry-picks an error structure trying to "squeeze more significance out of the remaining terms" (glmm.wikidot's FAQ once more).

Related Question