Solved – GLMM for overdispersed data

experiment-designgeneralized linear modellme4-nlmer

I am analyzing data from 3 field experiments (farms=3) for a citrus flower disease: response variable is binomial because the flower can only be diseased or healthy.

I have particular interest in comparing 5 fungicide spraying systems (trt=5).
I am not interested in the effect of a specific farm, they simply represent the total of farms from the region where I want to suggest the best treatments.

Each farm had 4 blocks (bk=4) including 2 trees as subsamples (tree=2) in which I assessed 100 flowers each one.

This is a quick look of the data:

dinc <- within(dinc, { tree_id <- as.factor(interaction(farm, trt, bk, tree)) })

farm      trt      bk   tree   dis   tot  tree_id
<fctr>   <fctr>  <fctr> <fctr><int> <int> <fctr>
iaras   Calendar   1      1     0   100  iaras.Calendar.1.1
iaras   Calendar   1      2     1   100  iaras.Calendar.1.2
iaras   Calendar   2      1     1   100  iaras.Calendar.2.1
iaras   Calendar   2      2     3   100  iaras.Calendar.2.2

The model I considered was:

resp <- with(df, cbind(dis, tot-dis)) 

m1 = glmer(resp ~ trt + (1|farm/bk) , family = binomial, data=df) 

I tested the overdispersion with the overdisp_fun() from GLMM page

        chisq         ratio             p          logp 
 4.191645e+02  3.742540e+00  4.804126e-37 -8.362617e+01 

As ratio (residual dev/residual df) > 1, and the p-value < 0.05, I considered to add the observation level random effect (link) to deal with the overdispersion.

so now was added a random effect for each row (tree_id) to the model, but I am not sure of how to include it. This is my approach:

m2 = glmer(resp ~ trt + (1|farm/bk) + (1|tree_id), family = binomial, data=df)

I also wonder if farm should be a fixed effect, since it has only 3 levels…

m3 = glmer(resp ~ trt * farm + (1|farm:bk) + (1|tree_id), family = binomial, data=df)

I really appreciate your suggestions about my model specifications…

Best Answer

Overdispersion Problem

It looks like you're modeling a count variable as a binomial and I think that's the source of your overdispersion.

You could model everything as a binomial distribution, but the total for each observation is exactly the same. Plus, the count of diseased plants never reaches the maximum of 100, so it's not really censored the way a binomial would be.

EDIT: So, you could easily report this as a "rate" of disease over the total sample. In this way you could analyze the 'count' of disease or proportion (disease / total) as a negative binomial model.

EDIT2: Because there seems to be some hesitance to use a negative binomial, here is a list of recent phytopathology articles (same discipline as OP) that model disease as a negative binomial (Prager et al., 2014, Mori et al., 2008, Passey et al., 2017, Paiva de Almeida et al., 2016)

A histogram of your y variable looks like a zero inflated negative binomial.

enter image description here

Note the long right tail that you typically see with a negative binomial or Poisson.

There are a few different ways to handle this, but here's an easy solution:

m4<-glmer.nb(dis ~ trt + (1 | farm/bk),data = dinc)

summary(m4)
overdisp_fun(m4)

I got the following overdispersion results:

      chisq       ratio         rdf           p 
122.1655582   1.0811111 113.0000000   0.2617332 

Looks good, right?

(EDIT: Ignore strikethrough portion below)

### Side Issue: Your Trees are Independent Observations

At first, it looks like each of the two trees should be a random effect.

However, Tree 1 on farm 1 is not comparable to Tree 1 in farm 2. Therefore, you don't want to model the effect of Tree as a random effect. Imagine if each Tree was a different person. Adding a random effect for each person wouldn't matter unless you had multiple observations per person.

Similarly, including the farm "block" doesn't really have an effect on the model.


Alternative Models and Final Thoughts

  • Could potentially check out zero inflated negative binomial
  • Although your dispersion doesn't seem bad with standard nb
  • The MASS package is an alternative way to run a nb model
  • Additionally you could run this as a Quasi-Poisson
  • I'll include some code below, in case you want to pursue this

 require("MASS")

 m5<-glmmPQL(dis ~ trt ,
             random = ~ 1 | farm/bk,
             family = negative.binomial(theta=9.86), 
             data = dinc)

 summary(m5)

 m6<-glmmPQL(dis ~ trt ,
             random = ~ 1 | farm/bk,
             family = quasipoisson(link='log'), 
             data = dinc)

 summary(m6)

Best of luck with your model!


EDIT In case you'd like to run this as a "rate", please try this code:

dinc$dis_prob<- dinc$dis / dinc$tot 

m7<-glmmPQL(dis_prob ~ trt ,
             random = ~ 1 | farm/bk,
             family = quasipoisson(link='log'), 
             data = dinc)

summary(m7)
Related Question