Solved – Is it appropriate to account for overdispersion in a glm by using a quasi-binomial distribution

lme4-nlmeoverdispersionquasi-binomialr

I have several sets of count data (as below) that are overdispersed. The overdispersion likely comes as a result of the number of zeros in the data which I understand means the paramater estimates in my model are okay, but the standard error may be lower than expected. The overdispersion is not due to outliers, the predictors are correctly expressed and I don't believe sparseness is a problem.

Temperature Alive   Dead
28  10  0
28  10  0
28  9   1
28  10  0
30  10  0
30  10  0
30  10  0
30  10  0
32  9   1
32  9   1
32  9   1
32  10  0
34  0   10
34  0   10
34  0   10
34  2   8
36  0   10
36  0   10
36  0   10
36  0   10

I've looked into using a negative binomial distribution which better describes overdispersion than a model with a Poisson distribution. I don't think this applies to my data where there are a fixed number of treatments.

Adding an observation-level random effect does not better describe zero-inflated data (Harrison 2014).

What I'm left with is potentially using a quasi-binomial distribution which has a PMF that includes a dispersion parameter to model a generalized linear mixed effects model (the data are also blocked by trial). This method is "less sensitive to the distribution of the response" (Faraway 2016).

Is this an appropriate method to account for the dispersion in my data? If so, could you explain how it accounts for the dispersion or is it simply the probability mass function that does this?

**Edit

To show how I came to the conclusion overdispersion is a problem, I calculated a dispersion parameter and checked it with bootstrapping as follows:

n<-1000
Disp=numeric(n)
for (i in 1:n){
  data_sim <- simulate(model)
  data_sim<-data_sim[,1]
  data_sim<-as.data.frame(data_sim)
  data_sim$Temperature<-ostad$Temperature
  model_sim <- glm(cbind(Alive, Dead)~Temperature, data=data_sim, family=binomial(link="logit"))
  Disp[i] <- sum(resid(model_sim, type="pearson")^2)/model_sim$df.res
}
sigma2<-sum(resid(model, type="pearson")^2)/model$df.re

The dispersion parameter was outside of the 95% confidence interval predicted from bootstrapping therefore indicating the dispersion for the real data was larger than expected from the model.

Best Answer

Firstly, I don't think you have zero-inflation in your data (or at least the data that you have included in the question). Zero-inflation arises when something (unobserved) results in a zero count/observation even though the other predictors suggest that the observation should be positive. In your case, a (silly) example might be a disgruntled grad student sneaking into the lab and spraying DDT on some of the warm temperature experiments to mess with your head - even though the subjects should have survived at e.g. 28 degrees, some unseen force has prevented this from happening. A less silly example is a recorded zero abundance in habitat data simply because a perfectly suitable area has never been colonised (either by chance or some physical barrier), or a zero recording of parasite counts from a highly susceptible animal that has never been exposed to the parasites. I think people are generally too quick to jump to zero-inflated models simply because of a large number of observed zeros - see also:
Warton, D.I. 2005. Many zeros does not mean zero inflation: comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16:275–289.

So if you want to model over-dispersion then I would suggest using an observation level random effect (where 'observation' is the number dead and alive from each group e.g. {10,0} for the first row). I have used this approach successfully for similar analyses, although generally for larger group sizes than 10.

However based on the data you have shown I don't think this is necessary either: all of the observations below 32 degrees are entirely consistent with a common probability of survival (around 97%), and all of the observations above 34 degrees are also entirely consistent with a common probability of survival (around 3%). If you fit an over dispersed model to this then the optimiser will probably reduce the over dispersion component to zero. If this really is your data then what you actually need to fit is a temperature threshold effect (e.g. above/below 33 degrees), which will then describe the data so well that it will in fact be quasi separated ... leading you to potentially more problems! Of course it is also possible that the data you have shown is incomplete and/or a fabricated example, in which case you can ignore this paragraph :)

---- EDIT IN RESPONSE TO EDITED QUESTION ----

The model that you are tying to fit uses a linear effect of temperature, but your data suggests that the effect is not linear (on the logit scale). If you have only a linear effect of temperature then an additional parameter (over dispersion) is needed to suck up the extra unexplained variation in the response, but you may be able to do a better job with a more appropriate effect of temperature. Try the following code for inspiration.

Your data, and a new data frame to use only for visualising predictions:

df <- read.table(header=TRUE, file=textConnection('Temperature Alive   Dead
28  10  0
28  10  0
28  9   1
28  10  0
30  10  0
30  10  0
30  10  0
30  10  0
32  9   1
32  9   1
32  9   1
32  10  0
34  0   10
34  0   10
34  0   10
34  2   8
36  0   10
36  0   10
36  0   10
36  0   10'))

df$Response <- cbind(df$Alive, df$Dead)
	df$Proportion <- df$Alive / (df$Alive + df$Dead)
	df$Replicate <- 1:4

newdata <- data.frame(Temperature=seq(28,36,length=1000))

The model you are using assumes a linear (on the logit scale) effect of temperature, but the plot of the datapoints suggests a more drastic change between 32 and 34 degrees than is consistent with a linear change:

model1 <- glm(Response ~ Temperature, family=binomial, data=df)
extractAIC(model1)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model1, type='response', newdata=newdata), type='l')

A simple threshold effect of 33 degrees gives a better prediction:

model2 <- glm(Response ~ I(Temperature > 33), family=binomial, data=df)
extractAIC(model2)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model2, type='response', newdata=newdata), type='l')

An alternative is to use a polynomial expansion to explain a curve with (almost) arbitrary shape - the highest order we can use with your data is 4 but this seems to give the best fit:

model3 <- glm(Response ~ poly(Temperature, 4), family=binomial, data=df)
extractAIC(model3)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model3, type='response', newdata=newdata), type='l')

I haven't checked, but I suspect that your test for over dispersion would indicate no problems with either models 2 or 3. The obvious problem with model 2 is that we have chosen the threshold based on the data, so this doesn't help you find the threshold itself using the model. For that reason I'd probably use something more like model 3.

Related Question