Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

count-datalme4-nlmemixed modelnegative-binomial-distributionzero inflation

I have a very small data set on solitary bee abundance that I am having trouble analysing. It’s count data, and almost all the counts are in one treatment with most of the zeroes in the other treatment. There are also a couple of very high values (one each at two of the six sites), so the distribution of the counts has an extremely long tail. I’m working in R. I have used two different packages: lme4 and glmmADMB.

Poisson mixed models didn’t fit: models were very overdispersed when random effects were not fitted (glm model), and underdispersed when random effects were fitted (glmer model). I don't understand why this is. The experimental design calls for nested random effects so I need to include them.
A Poisson lognormal error distribution did not improve the fit. I tried negative binomial error distribution using glmer.nb and couldn’t get it to fit – iteration limit reached, even when changed the tolerance using glmerControl(tolPwrss=1e-3).

Because a lot of the zeroes will be due to the fact that I simply didn’t see the bees (they are often tiny black things), I next tried a zero-inflated model. The ZIP didn’t fit well. The ZINB was the best model fit so far, but I am still not too happy with the model fit. I am at a loss as to what to try next. I did try a hurdle model but couldn’t fit a truncated distribution to the non-zero outcomes– I think because so many of the zeroes are in the control treatment (the error message was “Error in model.frame.default(formula = s.bee ~ tmt + lu + : variable lengths differ (found for 'treatment')”).

In addition, I think that the interaction I have included is doing something strange to my data as the coefficients are unrealistically small – although the model containing the interaction was best when I compared models using AICctab in package bbmle.

I am including some R script that will pretty much reproduce my data set.
Variables are as follows:

d=Julian date,
df=Julian date (as factor) ,
d.sq=df squared (number of bees increases then falls throughout the summer),
st=site,
s.bee=count of bees,
tmt=treatment,
lu=type of land use,
hab=percentage of semi natural habitat in surrounding landscape,
ba=boundary area round fields.

Any suggestions as to how I can obtain a good model fit (alternative error distributions, different types of model etc) would be very gratefully received!

Thank you.

d <- c(80,  80,  121, 121, 180, 180, 86,  86,  116, 116, 144, 144, 74,  74, 143, 143, 163, 163, 71, 71,106, 106, 135, 135, 162, 162, 185, 185, 83,  83,  111, 111, 133, 133, 175, 175, 85,  85,  112, 112,137, 137, 168, 168, 186, 186, 64,  64,  95,  95,  127, 127, 156, 156, 175, 175, 91,  91, 119, 119,120, 120, 148, 148, 56, 56)
df <- as.factor(d)
d.sq <- d^2
st <- factor(rep(c("A", "B", "C", "D", "E", "F"), c(6,12,18,10,14,6)))
s.bee <- c(1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,4,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3,0,0,0,0,5,0,0,2,0,50,0,10,0,4,0,47,3)
tmt <- factor(c("AF","C","C","AF","AF","C","AF","C","AF","C","C","AF","AF","C","AF","C","AF","C","AF","C",
"C","AF","AF","C","AF","C","C","AF","AF","C","AF","C","AF","C","AF","C","AF","C","AF","C",
"C","AF","AF","C","AF","C","AF","C","AF","C","C","AF","C","AF","C","AF","AF","C","AF","C",
"AF","C","AF","C","AF","C"))
lu <- factor(rep(c("p","a","p","a","p"), c(6,12,28,14,6)))
hab <- rep(c(13,14,13,14,3,4,3,4,3,4,3,4,3,4,15,35,37,35,37,35,37,35,37,0,2,1,2,1,2,1), 
        c(1,2,2,1,1,1,1,2,2,1,1,1,1,1,18,1,1,1,2,2,1,1,1,14,1,1,1,1,1,1))
ba <-  c(480,6520,6520,480,480,6520,855,1603,855,1603,1603,855,855,12526,855,5100,855,5100,2670,7679,7679,2670,
2670,7679,2670,7679,7679,2670,2670,7679,2670,7679,2670,7679,2670,7679,1595,3000,1595,3000,3000,1595,1595,3000,1595
,3000,4860,5460,4860,5460,5460,4860,5460,4860,5460,4860,4840,5460,4840,5460,3000,1410,3000,1410,3000,1410)
data <- data.frame(st,df,d.sq,tmt,lu,hab,ba,s.bee)
with(data, table(s.bee, tmt) )

# below is a much abbreviated summary of attempted models:

library(MASS)
library(lme4)
library(glmmADMB)
library(coefplot2)

###
### POISSON MIXED MODEL

    m1 <- glmer(s.bee ~ tmt + lu + hab + (1|st/df), family=poisson)
    summary(m1)

    resdev<-sum(resid(m1)^2)
    mdf<-length(fixef(m1)) 
    rdf<-nrow(data)-mdf
    resdev/rdf
# 0.2439303
# underdispersed. ???



###
### NEGATIVE BINOMIAL MIXED MODEL

    m2 <- glmer.nb(s.bee ~ tmt + lu + hab + d.sq + (1|st/df))
# iteration limit reached. Can't make a model work.



###
### ZERO-INFLATED POISSON MIXED MODEL

    fit_zipoiss <- glmmadmb(s.bee~tmt + lu + hab + ba + d.sq +
                    tmt:lu +
                    (1|st/df), data=data,
                    zeroInflation=TRUE,
                    family="poisson")
# has to have lots of variables to fit
# anyway Poisson is not a good fit



###
### ZERO-INFLATED NEGATIVE BINOMIAL MIXED MODELS

## BEST FITTING MODEL SO FAR:

    fit_zinb <- glmmadmb(s.bee~tmt + lu + hab +
                    tmt:lu +
                    (1|st/df),data=data,
                    zeroInflation=TRUE,
                    family="nbinom")
    summary(fit_zinb)
# coefficients are tiny, something odd going on with the interaction term
# but this was best model in AICctab comparison

# model check plots
    qqnorm(resid(fit_zinb))
    qqline(resid(fit_zinb))

    coefplot2(fit_zinb)

    resid_zinb <- resid(fit_zinb , type = "pearson")
    hist(resid_zinb)

    fitted_zinb <- fitted (fit_zinb)
    plot(resid_zinb ~ fitted_zinb)


## MODEL WITHOUT INTERACTION TERM - the coefficients are more realistic:

    fit_zinb2 <- glmmadmb(s.bee~tmt + lu + hab +
                    (1|st/df),data=data,
                    zeroInflation=TRUE,
                    family="nbinom")

# model check plots
    qqnorm(resid(fit_zinb2))
    qqline(resid(fit_zinb2))

    coefplot2(fit_zinb2)

    resid_zinb2 <- resid(fit_zinb2 , type = "pearson")
    hist(resid_zinb2)

    fitted_zinb2 <- fitted (fit_zinb2)
    plot(resid_zinb2 ~ fitted_zinb2)



# ZINB models are best so far
# but I'm not happy with the model check plots

Best Answer

This post has four years, but I wanted to follow on what fsociety said in a comment. Diagnosis of residuals in GLMMs is not straightforward, since standard residual plots can show non-normality, heteroscedasticity, etc., even if the model is correctly specified. There is an R package, DHARMa, specifically suited for diagnosing these type of models.

The package is based on a simulation approach to generate scaled residuals from fitted generalized linear mixed models and generates different easily interpretable diagnostic plots. Here is a small example with the data from the original post and the first fitted model (m1):

library(DHARMa)
sim_residuals <- simulateResiduals(m1, 1000)
plotSimulatedResiduals(sim_residuals)

DHARMa residuals plot

The plot on the left shows a QQ plot of the scaled residuals to detect deviations from the expected distribution, and the plot on the right represents residuals vs predicted values while performing quantile regression to detect deviations from uniformity (red lines should be horizontal and at 0.25, 0.50 and 0.75).

Additionally, the package has also specific functions for testing for over/under dispersion and zero inflation, among others:

testOverdispersionParametric(m1)

Chisq test for overdispersion in GLMMs

data:  poisson
dispersion = 0.18926, pearSS = 11.35600, rdf = 60.00000, p-value = 1
alternative hypothesis: true dispersion greater 1

testZeroInflation(sim_residuals)

DHARMa zero-inflation test via comparison to expected zeros with 
simulation under H0 = fitted model


data:  sim_residuals
ratioObsExp = 0.98894, p-value = 0.502
alternative hypothesis: more
Related Question