Solved – Repeated measures – random effects for logistic regression in R

logisticmixed modelrrandom-effects-model

Study design

504 individuals were all sampled 2 times. Once before and once after a celebration.

The goal is to investigate if this event (Celebration) as well as working with animals (sheepdog) have an influence on the probability that an individual gets infected by a parasite. (out of 1008 observations only 22 are found to be infected)

Variables

  • dependent variable = "T_hydat" (infected or not)

(most predictiv variables are categorical)

  • "Celebration" (yes/no)
  • "sex" (m/f)
  • "RelAge" (5 levels)
  • "SheepDog" (yes/no)
  • "Area" (geographical area = 4 levels)
  • "InfectionPeriodT_hydat" (continuous –> Nr Days after deworming")
  • "Urbanisation (3 levels)

Question 1:

1) Should I include Individual-ID ("ID") as a random Effekt as I sampled each Ind. 2 times? (Pseudoreplication?)

mod_fail <- glmer( T_hydat ~ Celebration + Sex + RelAge + SheepDog + InfectionPeriodT_hydat + Urbanisation + (1|ID), family = binomial)

Warnmeldungen:
1: In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf,  :
  failure to converge in 10000 evaluations
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 1.10808 (tol = 0.001, component 10)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

–> unfortunately this model fails to converge (is it a problem that ID = 504 levels with only 2 observations per level?)
Convergence is achieved with glmmPQL() but after droping some unsignficant preditiv variables the model fails to converge again ? What is the Problem here? Could geeglm() be a solution?

In an other attempt I run the model only with "Area" (4 levels) as random effect (my expectation is that Ind. in the same geogr. Area are suffering from the same parasite pressure etc.) and received the follwoing p-Values.

My model in R:

mod_converges <- glmer( T_hydat ~ Celebration + Sex + RelAge + SheepDog + InfectionPeriodT_hydat + Urbanisation + (1|Area), family = binomial)

mod_converges output:

summary(mod_converges)

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: binomial  ( logit )
Formula: T_hydat ~ Celebration + sex + SheepDog + RelAge + Urbanisation +  
    InfectionPeriodT_hydat + (1 | Area)
   Data: dat

     AIC      BIC   logLik deviance df.resid 
   203.0    262.0    -89.5    179.0      996 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-0.461 -0.146 -0.088 -0.060 31.174 

Random effects:
 Groups Name        Variance Std.Dev.
 Area   (Intercept) 0.314    0.561   
Number of obs: 1008, groups:  Area, 4

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -6.81086    1.96027   -3.47  0.00051 ***
Celebration1            1.36304    0.57049    2.39  0.01688 *  
sexm                   -0.18064    0.49073   -0.37  0.71279    
SheepDog1               2.02983    0.51232    3.96  7.4e-05 ***
RelAge2                 0.34815    1.18557    0.29  0.76902    
RelAge3                 0.86344    1.05729    0.82  0.41412    
RelAge4                -0.54501    1.43815   -0.38  0.70471    
RelAge5                 0.85741    1.25895    0.68  0.49584    
UrbanisationU           0.17939    0.78669    0.23  0.81962    
UrbanisationV           0.01237    0.59374    0.02  0.98338    
InfectionPeriodT_hydat  0.00324    0.01159    0.28  0.77985    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

glmmPQL-Model output below (added after the first reply by usεr11852)

This model converges with "Sample_ID" as a random effect, however, as statet by
usεr11852 the varaince of the random effect is quiet high 4.095497^2 = 16.8. And the std.error of Area5 is far to high (complete separation).
Can I just remove Datapoints from Area5 to overcome this Problem?

#     T_hydat
#  Area   0   1
#     1 226   4
#     2 203   3
#     4 389  15
#     5 168   0  ## here is the problematic cell

Linear mixed-effects model fit by maximum likelihood
 Data: dat 
  AIC BIC logLik
   NA  NA     NA

Random effects:
 Formula: ~1 | Sample_ID
        (Intercept)  Residual
StdDev:    4.095497 0.1588054

Variance function:
 Structure: fixed weights
 Formula: ~invwt 
Fixed effects: T_hydat ~ Celebration + sex + SheepDog + YoungOld + Urbanisation +      InfectionPeriodT_hydat + Area 
                            Value Std.Error  DF    t-value p-value
(Intercept)            -20.271630     1.888 502 -10.735869  0.0000
Celebration1             5.245428     0.285 502  18.381586  0.0000
sexm                    -0.102451     0.877 495  -0.116865  0.9070
SheepDog1                3.356856     0.879 495   3.817931  0.0002
YoungOldyoung            0.694322     1.050 495   0.661017  0.5089
UrbanisationU            0.660842     1.374 495   0.480990  0.6307
UrbanisationV            0.494653     1.050 495   0.470915  0.6379
InfectionPeriodT_hydat   0.059830     0.007 502   8.587736  0.0000
Area2                   -1.187005     1.273 495  -0.932576  0.3515
Area4                   -0.700612     0.973 495  -0.720133  0.4718
Area5                  -23.436977 28791.059 495  -0.000814  0.9994
 Correlation: 
                       (Intr) Clbrt1 sexm   ShpDg1 YngOld UrbnsU UrbnsV InfPT_ Area2  Area4 
Celebration1           -0.467                                                               
sexm                   -0.355  0.018                                                        
SheepDog1              -0.427  0.079  0.066                                                 
YoungOldyoung          -0.483  0.017  0.134  0.045                                          
UrbanisationU          -0.273  0.005 -0.058  0.317 -0.035                                   
UrbanisationV          -0.393  0.001 -0.138  0.417 -0.087  0.586                            
InfectionPeriodT_hydat -0.517  0.804  0.022  0.088  0.016  0.007  0.003                     
Area2                  -0.044 -0.035 -0.044 -0.268 -0.070 -0.315 -0.232 -0.042              
Area4                  -0.213 -0.116 -0.049 -0.186 -0.023 -0.119  0.031 -0.148  0.561       
Area5                   0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000

Standardized Within-Group Residuals:
          Min            Q1           Med            Q3           Max 
-14.208465914  -0.093224405  -0.022551663  -0.004948562  14.733133744 

Number of Observations: 1008
Number of Groups: 504 

Output from logistf (Firth's penalized-likelihood logistic regression)

logistf(formula = T_hydat ~ Celebration + sex + SheepDog + YoungOld + 
    Urbanisation + InfectionPeriodT_hydat + Area, data = dat, 
    family = binomial)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood 

                               coef   se(coef)  lower 0.95  upper 0.95       Chisq
(Intercept)            -5.252164846 1.52982941 -8.75175093 -2.24379091 12.84909207
Celebration1            1.136833737 0.49697927  0.14999782  2.27716500  5.17197661
sexm                   -0.200450540 0.44458464 -1.09803320  0.77892986  0.17662930
SheepDog1               2.059166246 0.47197694  1.10933774  3.12225212 18.92002321
YoungOldyoung           0.412641416 0.56705186 -0.66182554  1.77541644  0.50507269
UrbanisationU           0.565030324 0.70697218 -0.98974390  1.97489240  0.56236485
UrbanisationV           0.265401035 0.50810444 -0.75429596  1.33772658  0.25619218
InfectionPeriodT_hydat -0.003590666 0.01071497 -0.02530179  0.02075254  0.09198425
Area2                  -0.634761551 0.74958750 -2.27274031  0.90086554  0.66405078
Area4                   0.359032194 0.57158464 -0.76903324  1.63297249  0.37094569
Area5                  -2.456953373 1.44578029 -7.36654837 -0.13140806  4.37267766
                                  p
(Intercept)            3.376430e-04
Celebration1           2.295408e-02
sexm                   6.742861e-01
SheepDog1              1.363144e-05
YoungOldyoung          4.772797e-01
UrbanisationU          4.533090e-01
UrbanisationV          6.127483e-01
InfectionPeriodT_hydat 7.616696e-01
Area2                  4.151335e-01
Area4                  5.424892e-01
Area5                  3.651956e-02

Likelihood ratio test=36.56853 on 10 df, p=6.718946e-05, n=1008
Wald test = 32.34071 on 10 df, p = 0.0003512978

** glmer Model (Edited 28th Jan 2016)**

Output from glmer2var: Mixed effect model with the 2 most "important" variables ("Celebration" = the factor I am interested in and "SheepDog" which was found to have a significant influence on infection when data before and after the celebration were analysed separately.) The few number of positives make it impossible to fit a model with more than two explanatory variables (see commet EdM).

There seems to be a strong effect of "Celebration" that probably cancels out the effect of "SheepDog" found in previous analysis.

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: binomial  ( logit )
Formula: T_hydat ~ Celebration + SheepDog + (1 | Sample_ID)
   Data: dat

     AIC      BIC   logLik deviance df.resid 
   113.0    132.6    -52.5    105.0     1004 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.5709 -0.0022 -0.0001  0.0000 10.3491 

Random effects:
 Groups    Name        Variance Std.Dev.
 Sample_ID (Intercept) 377.1    19.42   
Number of obs: 1008, groups:  Sample_ID, 504

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -19.896      4.525  -4.397  1.1e-05 ***
Celebration1    7.626      2.932   2.601  0.00929 ** 
SheepDog1       1.885      2.099   0.898  0.36919    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) Clbrt1
Celebratin1 -0.908       
SheepDog1   -0.297 -0.023

Question 2:

2) Can I use drop1() to get the final model and use the p-Values from summary(mod_converges) for interpretation? Does my output tell me if it makes sense to include the random effect ("Area") ?

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: binomial  ( logit )
Formula: T_hydat ~ Celebration + SheepDog + (1 | Area)
   Data: dat

     AIC      BIC   logLik deviance df.resid 
   190.8    210.4    -91.4    182.8     1004 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-0.369 -0.135 -0.096 -0.071 17.438 

Random effects:
 Groups Name        Variance Std.Dev.
 Area   (Intercept) 0.359    0.599   
Number of obs: 1008, groups:  Area, 4

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.912      0.698   -8.47  < 2e-16 ***
Celebration1    1.287      0.512    2.51    0.012 *  
SheepDog1       2.014      0.484    4.16  3.2e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) Clbrt1
Celebratin1 -0.580       
SheepDog1   -0.504  0.027

I know there are quite a few questions but I would really appreciate some advice from experienced people. Thanks!

Best Answer

I think that your original model with 504 levels with each level having two readings is problematic because it potentially suffers from complete separation, especially given the small number of positives in your sample. By complete separation I mean that for a given combination of covariates all responses are the same (usually 0 or 1). You might want to try a different optimizer (ie. something along the lines glmerControl(optimizer='bobyqa','Nelder_Mead', etc., ...) but I would not be very confident that this would work either. In general having some levels with one or two observations is not a problem but when all of them are so low things become computationally odd because you starts having identifiability issues (eg. you definitely cannot evaluate any slopes as a random slope plus a random intercept for every individual would give you one random effect for every observation). You really lose a lot of degrees of freedom any way you count them. You do not show the glmmPQL output but I suspect a very high variance of the random effect that would strongly suggest that there is complete separation. (EDIT: You now show that output and can you can clearly see that the ratio is indeed very high.) You might want to consider using the function logistf from the package with the same name. logistf will fit a penalized logistic regression model that will probably alleviate the issue you experience; it will not use any random effects.

The rule of thumb for the lowest number of levels a random effect can be estimated reasonably is "5 or 6"; below that your estimate for the standard deviation of that effect would really suffer. With this in mind, no; you using Area having just four (4) levels is too aggressive. Probably it makes more sense to use it a fixed effect. In general if I do not get at least 10-11 random effects I am a bit worried about the validity of the random effects assumption; we are estimating a Gaussian after-all.

Yes, you could use drop1 but really be careful not to start data-dredging (which is a bad thing). Take any variable selection procedure with a grain of salt. This is issue is extesnsively discussed in Cross-Validated; eg. see the following two great threads for starters here and here. Maybe it is more reasonable to include certain "insignificant" variables in a model so one can control for them and then comment on why they came out insignificant rather then just break down a model to the absolutely bare-bone where everything is uber-signficant. In any case I would strongly suggest using bootstrap to get confidence intervals for estimated parameters.

Related Question