Solved – Is it appropriate to account for overdispersion in a glm by using a quasi-binomial distribution

lme4-nlmeoverdispersionquasi-binomialr

I have several sets of count data (as below) that are overdispersed. The overdispersion likely comes as a result of the number of zeros in the data which I understand means the paramater estimates in my model are okay, but the standard error may be lower than expected. The overdispersion is not due to outliers, the predictors are correctly expressed and I don't believe sparseness is a problem.

Temperature Alive   Dead
28  10  0
28  10  0
28  9   1
28  10  0
30  10  0
30  10  0
30  10  0
30  10  0
32  9   1
32  9   1
32  9   1
32  10  0
34  0   10
34  0   10
34  0   10
34  2   8
36  0   10
36  0   10
36  0   10
36  0   10

I've looked into using a negative binomial distribution which better describes overdispersion than a model with a Poisson distribution. I don't think this applies to my data where there are a fixed number of treatments.

Adding an observation-level random effect does not better describe zero-inflated data (Harrison 2014).

What I'm left with is potentially using a quasi-binomial distribution which has a PMF that includes a dispersion parameter to model a generalized linear mixed effects model (the data are also blocked by trial). This method is "less sensitive to the distribution of the response" (Faraway 2016).

Is this an appropriate method to account for the dispersion in my data? If so, could you explain how it accounts for the dispersion or is it simply the probability mass function that does this?

**Edit

To show how I came to the conclusion overdispersion is a problem, I calculated a dispersion parameter and checked it with bootstrapping as follows:

n<-1000
Disp=numeric(n)
for (i in 1:n){
  data_sim <- simulate(model)
  data_sim<-data_sim[,1]
  data_sim<-as.data.frame(data_sim)
  data_sim$Temperature<-ostad$Temperature
  model_sim <- glm(cbind(Alive, Dead)~Temperature, data=data_sim, family=binomial(link="logit"))
  Disp[i] <- sum(resid(model_sim, type="pearson")^2)/model_sim$df.res
}
sigma2<-sum(resid(model, type="pearson")^2)/model$df.re

The dispersion parameter was outside of the 95% confidence interval predicted from bootstrapping therefore indicating the dispersion for the real data was larger than expected from the model.

Best Answer

Firstly, I don't think you have zero-inflation in your data (or at least the data that you have included in the question). Zero-inflation arises when something (unobserved) results in a zero count/observation even though the other predictors suggest that the observation should be positive. In your case, a (silly) example might be a disgruntled grad student sneaking into the lab and spraying DDT on some of the warm temperature experiments to mess with your head - even though the subjects should have survived at e.g. 28 degrees, some unseen force has prevented this from happening. A less silly example is a recorded zero abundance in habitat data simply because a perfectly suitable area has never been colonised (either by chance or some physical barrier), or a zero recording of parasite counts from a highly susceptible animal that has never been exposed to the parasites. I think people are generally too quick to jump to zero-inflated models simply because of a large number of observed zeros - see also:
Warton, D.I. 2005. Many zeros does not mean zero inflation: comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16:275–289.

So if you want to model over-dispersion then I would suggest using an observation level random effect (where 'observation' is the number dead and alive from each group e.g. {10,0} for the first row). I have used this approach successfully for similar analyses, although generally for larger group sizes than 10.

However based on the data you have shown I don't think this is necessary either: all of the observations below 32 degrees are entirely consistent with a common probability of survival (around 97%), and all of the observations above 34 degrees are also entirely consistent with a common probability of survival (around 3%). If you fit an over dispersed model to this then the optimiser will probably reduce the over dispersion component to zero. If this really is your data then what you actually need to fit is a temperature threshold effect (e.g. above/below 33 degrees), which will then describe the data so well that it will in fact be quasi separated ... leading you to potentially more problems! Of course it is also possible that the data you have shown is incomplete and/or a fabricated example, in which case you can ignore this paragraph :)

---- EDIT IN RESPONSE TO EDITED QUESTION ----

The model that you are tying to fit uses a linear effect of temperature, but your data suggests that the effect is not linear (on the logit scale). If you have only a linear effect of temperature then an additional parameter (over dispersion) is needed to suck up the extra unexplained variation in the response, but you may be able to do a better job with a more appropriate effect of temperature. Try the following code for inspiration.

Your data, and a new data frame to use only for visualising predictions:

df <- read.table(header=TRUE, file=textConnection('Temperature Alive   Dead
28  10  0
28  10  0
28  9   1
28  10  0
30  10  0
30  10  0
30  10  0
30  10  0
32  9   1
32  9   1
32  9   1
32  10  0
34  0   10
34  0   10
34  0   10
34  2   8
36  0   10
36  0   10
36  0   10
36  0   10'))

df$Response <- cbind(df$Alive, df$Dead)
	df$Proportion <- df$Alive / (df$Alive + df$Dead)
	df$Replicate <- 1:4

newdata <- data.frame(Temperature=seq(28,36,length=1000))

The model you are using assumes a linear (on the logit scale) effect of temperature, but the plot of the datapoints suggests a more drastic change between 32 and 34 degrees than is consistent with a linear change:

model1 <- glm(Response ~ Temperature, family=binomial, data=df)
extractAIC(model1)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model1, type='response', newdata=newdata), type='l')

A simple threshold effect of 33 degrees gives a better prediction:

model2 <- glm(Response ~ I(Temperature > 33), family=binomial, data=df)
extractAIC(model2)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model2, type='response', newdata=newdata), type='l')

An alternative is to use a polynomial expansion to explain a curve with (almost) arbitrary shape - the highest order we can use with your data is 4 but this seems to give the best fit:

model3 <- glm(Response ~ poly(Temperature, 4), family=binomial, data=df)
extractAIC(model3)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model3, type='response', newdata=newdata), type='l')

I haven't checked, but I suspect that your test for over dispersion would indicate no problems with either models 2 or 3. The obvious problem with model 2 is that we have chosen the threshold based on the data, so this doesn't help you find the threshold itself using the model. For that reason I'd probably use something more like model 3.

The glm dispersion

the glm dispersion is an expression based on the normalized residuals.

So say you have a model fittedModelName Then you get the same for

manual calculation

sum( residuals(fittedModelName, "pearson")^2 ) / fittedModelName$df.residual

automatic
```
summary(fittedModelName)$dispersion
```

note that you can define the quasibinomial model in two ways. You could use:

scaled from 0 to 1
```
formula = Rcnt/total ~ LANG
```

scaled from 0 to n

formula = cbind(Rcnt, total - Rcnt) ~ LANG

which gives different results for the calculated dispersion

The beta-binomial regression by aod's betabin

In the betabin case the reported dispersion is a model parameter. This is explained in the documentation of the function.

The function uses the parameterization .... $\varphi = 1 / (a1 + a2 + 1)$ ... and $\varphi$ is the overdispersion parameter.

You can test this also with the code below:

library(aod)

#generate data

set.seed(1)
alpha = 1
beta = 1
n = 40
x <- rbeta(1000, alpha, beta)
y <- qbinom(runif(1000), n, x)

#modeling

mb <- betabin(cbind(y,n-y)~1, random=~1, data=as.data.frame(list(y=y)))

which shows that the used dispersion is the inverse of the sum $1+\alpha+\beta$ with $\alpha$ and $\beta$ the coefficients of the beta-distribution.

#comparison of dispersion

> mb@param[2]   
phi.(Intercept) 
       0.340159 
> 1/(alpha+beta+1)
[1] 0.3333333

References

The documentation is here:

https://www.rdocumentation.org/packages/aod/versions/1.3/topics/betabin

but I prefer to get these documentation files by typing the function name into the console betabin or ??betabin

Solved – Is an overdispersion parameter of 5.17 for GLMM with Beta family too high to yield reliable results

Using the word "overdispersion parameter" is maybe not an ideal choice by the glmmTMB developers - "dispersion parameter" may be more appropriate.

Overdispersion is when your residual variance is larger than what your fitted model assumes. It mainly only occurs in models that have a fixed dispersion (e.g. Poisson)

When you fit a model with variable dispersion (as you do here), your model adjusts the expected residual variance (=dispersion) during the fit. The dispersion parameter gives your feedback about this adjustment, but relative to this fitted dispersion, the model isn't overdispersed.

A dispersion test for a model with a dispersion parameter > 1 should therefore not indicate overdispersion (at least assuming that the residuals scatter with the estimated dispersion). See also https://github.com/florianhartig/DHARMa/issues/143. Also, There is no reason to put less trust in a model with a large estimated dispersion parameter.

Btw., DHARMa::testOverdispersion should work for the beta family. I have just tried this out, and we see exactly the behavior I describe above: estimated dispersion is around 4, but the dispersion test ist negative

x = runif(100,-1,1)
y = plogis(x + rnorm(100))

library(glmmTMB)
library(DHARMa)

fit <- glmmTMB(y~x, family = beta_family())

summary(fit)
res <-simulateResiduals(fit, plot = T)

Best Answer

Related Solutions

Solved – Types of dispersion parameter for binomial data

The glm dispersion

The beta-binomial regression by aod's betabin

References

Solved – Is an overdispersion parameter of 5.17 for GLMM with Beta family too high to yield reliable results

Related Question