Solved – Dealing with Overdispersed Negative Binomial using glmmTMB

ecologyglmmtmbmixed modelnegative-binomial-distributionoverdispersion

I'm new to the world of statistical modeling, but I was wondering if anyone had any input on how to handle over dispersed negative binomial data? I'm working on modeling bat activity as a response variable against a variety of insect, vegetation, and environmental variables. My objective is to see which explanatory variables (whether it be insect, vegetation, and/or environmental) are impacting bat activity the most.

My response variable is bat activity (count data) with an offset for # of survey nights the acoustic detectors ran for and is seemingly quite overdispersed. I've run Poisson models, all with the conclusion that they are overdispersed, so I've moved onto NB2 models using glmmTMB package; all predictor variables are scaled and centered. Below is the str of a few explanatory variables:

$ Year            :  Factor w/ 2 levels "2017", "2018": 1 1 1 1 1 1 1 1 1 1 1
 $ Habitat         : Factor w/ 4 levels "MCF","MM","MMF",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ Site            : Factor w/ 63 levels "MCF_001","MCF_002",..: 1 2 3 4 5 6 8 9 17 19 ...
 $ Bats            : int  4 1 47 61 5 14 7 84 6 3 ...
 $ Mylu            : int  3 0 38 13 0 1 0 6 4 0 ...
 $ Myse            : int  0 0 3 5 3 3 0 16 0 0 ...
 $ Survey.Nights   : int  4 5 6 4 4 4 5 4 4 5 ...
 $ Avg.Biomass     : num  -0.381 -0.481 0.908 -0.574 0.943 ...
 $ Shannon.Weaver  : num  -0.6412 0.0586 -0.2082 0.7039 0.7002 ...
 $ Num.Orders      : num  0.0711 -1.8912 0.0711 -1.8912 1.0522 ...
 $ Avg.Snags       : num  -0.851 1.837 0.224 0.493 -0.851 ...
 $ Avg.Understory  : num  -0.00711 -0.94428 3.51112 3.58282 0.55621 ...
 $ Avg.Midstory    : num  -0.35 0.255 -0.461 -0.589 -0.295 ...
 $ Avg.Canopy      : num  -1.056 0.692 1.129 1.129 0.911 ...
 $ Avg.Canopy.Cover: num  -0.822 0.514 1.182 0.982 1.182 ...
 $ Perc.Dec.Dom    : num  -0.491 -1.091 -1.942 -1.546 0.61 ...
 $ Avg.Bat.Date    : num  -0.7704 -0.9971 -0.2208 -0.2208 -0.0834 ...
 $ Avg.Bat.Night.Hr: num  -0.843 -0.951 -0.407 -0.429 -0.299 ...
 $ Avg.Bat.Temp    : num  0.5214 -0.5578 -1.0893 -0.2349 -0.0632 ...
 $ Bat.Dist.Edge   : num  -0.879 -0.432 -0.179 1.544 0.616 ...
 $ Bat.Elevation   : num  -0.741 -0.575 -0.12 -0.171 0.356 ...
 $ Bat.Moon        : num  0.667 -0.279 0.794 0.857 0.352 ...

nbin <- glmmTMB(Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory + 
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + Avg.Bat.Temp +
    Bat.Elevation + Bat.Moon + Bat.Water.Feat + Avg.Biomass + Num.Orders + 
    Avg.Bat.Temp*Avg.Bat.Date + Avg.Biomass*Year + Year + Habitat + 
    offset(log(Survey.Nights)) + (1|Site), 
    data = insect.data, 
    ziformula = ~0, 
    family = nbinom2)

summary(nbin)
 
Family: nbinom2  ( log )
Formula:          Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory +  
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + 
    Avg.Bat.Temp + Bat.Elevation + Bat.Moon + Bat.Water.Feat +
    Avg.Biomass + Num.Orders + Avg.Bat.Temp * Avg.Bat.Date +  
    Avg.Biomass * Year + Year + Habitat + offset(log(Survey.Nights)) +      
(1 | Site)
Data: insect.data

     AIC      BIC   logLik deviance df.resid 
     539      588     -247      495       47 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 Site   (Intercept) 2.44e-09 4.94e-05
Number of obs: 69, groups:  Site, 36

Overdispersion parameter for nbinom2 family (): 2.47 

Conditional model:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  0.526      0.572    0.92  0.35763    
Avg.Biomass                 -1.866      0.390   -4.78  1.7e-06 ***
Num.Orders                   0.876      0.136    6.44  1.2e-10 ***
Avg.Understory               0.431      0.120    3.58  0.00034 ***
Avg.Midstory                -2.148      0.319   -6.72  1.8e-11 ***
Avg.Canopy.Cover             0.465      0.190    2.45  0.01420 *  
Perc.Dec.Dom                 0.498      0.181    2.74  0.00606 ** 
Avg.Snags                    0.694      0.142    4.88  1.1e-06 ***
Avg.Bat.Date                 0.110      0.169    0.65  0.51553    
Avg.Bat.Temp                -0.197      0.205   -0.96  0.33524    
Bat.Elevation               -0.360      0.126   -2.86  0.00429 ** 
Bat.Moon                     0.541      0.111    4.85  1.2e-06 ***
Bat.Water.FeatRiver         -0.315      0.559   -0.56  0.57312    
Bat.Water.FeatStream         7.018      1.330    5.28  1.3e-07 ***
Year2018                     0.169      0.312    0.54  0.58789    
HabitatMM                    0.185      0.383    0.48  0.62982    
HabitatMMF                   0.146      0.348    0.42  0.67448    
HabitatREGEN                 1.121      0.356    3.15  0.00164 ** 
Avg.Bat.Date:Avg.Bat.Temp   -0.392      0.196   -2.00  0.04514 *  
Avg.Biomass:Year2018         1.500      0.375    4.00  6.2e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

res <- simulateResiduals(nbin)
plot(res,rank = T)

testResiduals(res)

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided

> testResiduals(res)
$uniformity

    One-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test

data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided


$uniformity

    One-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test

data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided][1]][1]

Then, I wanted to manually check dispersion and here's where I ran into some concerns

m1 <- nbin
dispfun <- function(m) {
        r <- residuals(m,type="pearson")
        n <- df.residual(m)
        dsq <- sum(r^2)
        c(dsq=dsq,n=n,disp=dsq/n)
}
options(digits=2)
dispfun(m1)

dsq    n disp 
76.1 47.0  1.6

This seems to indicate overdispersion in my model, however, I've already added covariates (as you can see, my model is quite complex and this is after dropping non-significant factors), and adding interactions (Hilbe 2011 suggestions). However, the DHARMa residuals look fairly decent. Which should I trust? Does anyone have any suggestions on how to handle this?

I reran with GLMMadaptive and got the following output with a different dispersion parameter:

Call:
mixed_model(fixed = Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + 
    Avg.Midstory + Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + 
    Avg.Bat.Date + Avg.Bat.Temp + Bat.Elevation + Bat.Moon + 
    Bat.Water.Feat + Avg.Biomass + Num.Orders + Avg.Bat.Temp * 
    Avg.Bat.Date + Avg.Biomass * Yr + Num.Orders * Yr + Avg.Bat.Date * 
    Bat.Moon + Yr + Habitat + offset(log(Survey.Nights)), random = (~1 | 
    Site), data = insect.data2, family = negative.binomial(), 
    iter_EM = 300)

Data Descriptives:
Number of Observations: 67
Number of Groups: 36 

Model:
 family: negative binomial
 link: log 

Fit statistics:
   log.Lik      AIC      BIC
 -230.2856 508.5711 546.5756

Random effects covariance matrix:
               StdDev
(Intercept) 0.0514579

Fixed effects:
                          Estimate Std.Err z-value    p-value
(Intercept)                 0.7447  0.5482  1.3584 0.17434114
Avg.Biomass                -1.5392  0.3861 -3.9871    < 1e-04
Num.Orders                  0.4840  0.1862  2.5987 0.00935661
Avg.Understory              0.2471  0.1299  1.9023 0.05713095
Avg.Midstory               -2.3953  0.3624 -6.6098    < 1e-04
Avg.Canopy.Cover            0.6657  0.1879  3.5422 0.00039685
Perc.Dec.Dom                0.5743  0.1737  3.3059 0.00094668
Avg.Snags                   0.5411  0.1494  3.6217 0.00029270
Avg.Bat.Date               -0.0040  0.1860 -0.0217 0.98266247
Avg.Bat.Temp               -0.7496  0.2795 -2.6818 0.00732270
Bat.Elevation              -0.3307  0.1270 -2.6032 0.00923670
Bat.Moon                    0.5336  0.1206  4.4251    < 1e-04
Bat.Water.FeatRiver        -0.7486  0.5586 -1.3402 0.18017727
Bat.Water.FeatStream        7.1474  1.4996  4.7663    < 1e-04
Yr2018                      0.4797  0.3066  1.5643 0.11774826
HabitatMM                  -0.0861  0.3768 -0.2285 0.81928969
HabitatMMF                 -0.3509  0.3605 -0.9735 0.33030629
HabitatREGEN                1.0362  0.3399  3.0486 0.00229947
Avg.Bat.Date:Avg.Bat.Temp  -0.6803  0.2172 -3.1324 0.00173393
Avg.Biomass:Yr2018          1.1956  0.3758  3.1815 0.00146534
Num.Orders:Yr2018           0.6276  0.2661  2.3584 0.01835350
Avg.Bat.Date:Bat.Moon       0.3587  0.1782  2.0130 0.04411454

log(dispersion) parameter:
  Estimate Std.Err
    1.0421  0.2256

Integration:
method: adaptive Gauss-Hermite quadrature rule
quadrature points: 11

Optimization:
method: hybrid EM and quasi-Newton
converged: TRUE

My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors.

Best Answer

A couple of points:

The variance of the random effect for site is extremely low. This could either mean that there is no correlations in the bat activity within a site or that could be an artefact of the Laplace approximation used behind glmmTMB() to approximate the integrals of the random effects. You could also try fitting the same model with the GLMMadaptive package that approximates the same integrals with the adaptive Gaussian quadrature procedure that can be more accurate. You can find examples here and here.
It would be better to check the fit of the model, and possible remaining over-dispersion using the scaled simulated residuals of the DHARMa package. An example using this package to check the fit of a negative binomial model can be found here.
It would be better to define the variables as factors beforehand and not inside the formula. Moreover, are you sure that you need all these interaction terms?

Related Solutions

Solved – Overdispersed poisson or negative binomial regression

Instead of overdispersed (or quasi-)poisson regression you can use the NB1 distribution, which has the same linear variance function as ODP and a full-fledged likelihood function instead of the quasilikelihood of ODP. NB1 is implemented in the gamlss package as family=NBII, whereas regular Negative Binomial can be called through family=NBI. All credit for this part of the answer goes to @Achim Zeileis for helping me with a similar question here: Why is the Quasipoisson in glm not treated as a special case of Negative Binomial? , see his post for more info regarding NB/NB1 (and the confusing naming conventions).

Regarding ANOVA, I have not been able to find a built-in method for gamlss objects, but it is not hard to write your own implementation of the chi-squared test statistic. An example:

set.seed(123)
data = rNBII(100,mu = 6,sigma=0.5)  #generate NB1 data with mean mu=5 and variance (1+sigma)*mu = 9
h0 = gamlss(data~1,family=PO) #null model: poisson
h1 = gamlss(data~1,family=NBII) #alternative model: NB1/ODP

df = h1$df.fit - h0$df.fit
deviance = as.numeric(-2*logLik(h0) + 2*logLik(h1))
p.value = pchisq(deviance,df,lower.tail=F)
> p.value
[1] 0.01429169 #reject the null model at > 95% confidence

Solved – Simulating an overdispersed negative binomial distribution

Since you have the mean and the variance, the easiest way to do this would be to use the parameterization via the mu and size parameters of rnbinom. Quoting from the help page:

An alternative parametrization (often used in ecology) is by the mean ‘mu’ (see above), and ‘size’, the dispersion parameter, where ‘prob’ = size/(size+mu)’. The variance is ‘mu + mu^2/size’ in this parametrization.

So we can use your mean and variance estimates and solve for size:

size=46.28^2/(4493154.99-46.28))

We can simulate like this:

rnbinom(4200,mu=46.28,size=46.28^2/(4493154.99-46.28))

However...

set.seed(1)
table(rnbinom(4200,mu=46.28,size=46.28^2/(4493154.99-46.28)))

    0     1     3     4     7     8     9    10    13    18    19    24    52 
 4174     1     1     1     1     1     1     1     1     1     1     1     1 
   68    69    72   115   186   197  1261  3654  3745  4212  5790 10304 14269 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
25467 
    1

Note especially the right tail. If you simulate this repeatedly, the maximum value and the entire right tail will fluctuate wildly. (I just sampled this five times and got maxima between 14,294 and 186,670.) This is a consequence of (a) enormous overdispersion (your SD is 45 times the mean!), and (b) a large sample size. And of course your estimate of the SD and the overdispersion is largely determined by the right tail of your observations.

Whatever you plan on doing with your data, I'd take a good hard look at it. Maybe perform some sensitivity analysis. What would your result look like if your maximum had been "only" 50,000? And so on.

It may be that whatever conclusions you draw in the course of your analysis are mainly driven by the largest ten of your observations, 0.2% of your sample of size 4200. That is not a very strong position to argue from.

Best Answer

Related Solutions

Solved – Overdispersed poisson or negative binomial regression

Solved – Simulating an overdispersed negative binomial distribution

Related Question