I am analyzing data from 3 field experiments (farms=3) for a citrus flower disease: response variable is binomial because the flower can only be diseased or healthy.
I have particular interest in comparing 5 fungicide spraying systems (trt=5).
I am not interested in the effect of a specific farm, they simply represent the total of farms from the region where I want to suggest the best treatments.
Each farm had 4 blocks (bk=4) including 2 trees as subsamples (tree=2) in which I assessed 100 flowers each one.
This is a quick look of the data:
dinc <- within(dinc, { tree_id <- as.factor(interaction(farm, trt, bk, tree)) })
farm trt bk tree dis tot tree_id
<fctr> <fctr> <fctr> <fctr><int> <int> <fctr>
iaras Calendar 1 1 0 100 iaras.Calendar.1.1
iaras Calendar 1 2 1 100 iaras.Calendar.1.2
iaras Calendar 2 1 1 100 iaras.Calendar.2.1
iaras Calendar 2 2 3 100 iaras.Calendar.2.2
The model I considered was:
resp <- with(df, cbind(dis, tot-dis))
m1 = glmer(resp ~ trt + (1|farm/bk) , family = binomial, data=df)
I tested the overdispersion with the overdisp_fun() from GLMM page
chisq ratio p logp
4.191645e+02 3.742540e+00 4.804126e-37 -8.362617e+01
As ratio (residual dev/residual df) > 1, and the p-value < 0.05, I considered to add the observation level random effect (link) to deal with the overdispersion.
so now was added a random effect for each row (tree_id
) to the model, but I am not sure of how to include it. This is my approach:
m2 = glmer(resp ~ trt + (1|farm/bk) + (1|tree_id), family = binomial, data=df)
I also wonder if farm
should be a fixed effect, since it has only 3 levels…
m3 = glmer(resp ~ trt * farm + (1|farm:bk) + (1|tree_id), family = binomial, data=df)
I really appreciate your suggestions about my model specifications…
Best Answer
Overdispersion Problem
It looks like you're modeling a count variable as a binomial and I think that's the source of your overdispersion.
You could model everything as a binomial distribution, but the total for each observation is exactly the same.
Plus, the count of diseased plants never reaches the maximum of 100, so it's not really censored the way a binomial would be.EDIT: So, you could easily report this as a "rate" of disease over the total sample. In this way you could analyze the 'count' of disease or proportion (disease / total) as a negative binomial model.
EDIT2: Because there seems to be some hesitance to use a negative binomial, here is a list of recent phytopathology articles (same discipline as OP) that model disease as a negative binomial (Prager et al., 2014, Mori et al., 2008, Passey et al., 2017, Paiva de Almeida et al., 2016)
A histogram of your y variable looks like a zero inflated negative binomial.
Note the long right tail that you typically see with a negative binomial or Poisson.
There are a few different ways to handle this, but here's an easy solution:
I got the following overdispersion results:
Looks good, right?
(EDIT: Ignore strikethrough portion below)
### Side Issue: Your Trees are Independent ObservationsAt first, it looks like each of the two trees should be a random effect.
However, Tree 1 on farm 1 is not comparable to Tree 1 in farm 2. Therefore, you don't want to model the effect of Tree as a random effect. Imagine if each Tree was a different person. Adding a random effect for each person wouldn't matter unless you had multiple observations per person.
Similarly, including the farm "block" doesn't really have an effect on the model.Alternative Models and Final Thoughts
Best of luck with your model!
EDIT In case you'd like to run this as a "rate", please try this code: