Solved – Dealing with Overdispersed Negative Binomial using glmmTMB

ecologyglmmtmbmixed modelnegative-binomial-distributionoverdispersion

I'm new to the world of statistical modeling, but I was wondering if anyone had any input on how to handle over dispersed negative binomial data? I'm working on modeling bat activity as a response variable against a variety of insect, vegetation, and environmental variables. My objective is to see which explanatory variables (whether it be insect, vegetation, and/or environmental) are impacting bat activity the most.

My response variable is bat activity (count data) with an offset for # of survey nights the acoustic detectors ran for and is seemingly quite overdispersed. I've run Poisson models, all with the conclusion that they are overdispersed, so I've moved onto NB2 models using glmmTMB package; all predictor variables are scaled and centered. Below is the str of a few explanatory variables:

$ Year            :  Factor w/ 2 levels "2017", "2018": 1 1 1 1 1 1 1 1 1 1 1
 $ Habitat         : Factor w/ 4 levels "MCF","MM","MMF",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ Site            : Factor w/ 63 levels "MCF_001","MCF_002",..: 1 2 3 4 5 6 8 9 17 19 ...
 $ Bats            : int  4 1 47 61 5 14 7 84 6 3 ...
 $ Mylu            : int  3 0 38 13 0 1 0 6 4 0 ...
 $ Myse            : int  0 0 3 5 3 3 0 16 0 0 ...
 $ Survey.Nights   : int  4 5 6 4 4 4 5 4 4 5 ...
 $ Avg.Biomass     : num  -0.381 -0.481 0.908 -0.574 0.943 ...
 $ Shannon.Weaver  : num  -0.6412 0.0586 -0.2082 0.7039 0.7002 ...
 $ Num.Orders      : num  0.0711 -1.8912 0.0711 -1.8912 1.0522 ...
 $ Avg.Snags       : num  -0.851 1.837 0.224 0.493 -0.851 ...
 $ Avg.Understory  : num  -0.00711 -0.94428 3.51112 3.58282 0.55621 ...
 $ Avg.Midstory    : num  -0.35 0.255 -0.461 -0.589 -0.295 ...
 $ Avg.Canopy      : num  -1.056 0.692 1.129 1.129 0.911 ...
 $ Avg.Canopy.Cover: num  -0.822 0.514 1.182 0.982 1.182 ...
 $ Perc.Dec.Dom    : num  -0.491 -1.091 -1.942 -1.546 0.61 ...
 $ Avg.Bat.Date    : num  -0.7704 -0.9971 -0.2208 -0.2208 -0.0834 ...
 $ Avg.Bat.Night.Hr: num  -0.843 -0.951 -0.407 -0.429 -0.299 ...
 $ Avg.Bat.Temp    : num  0.5214 -0.5578 -1.0893 -0.2349 -0.0632 ...
 $ Bat.Dist.Edge   : num  -0.879 -0.432 -0.179 1.544 0.616 ...
 $ Bat.Elevation   : num  -0.741 -0.575 -0.12 -0.171 0.356 ...
 $ Bat.Moon        : num  0.667 -0.279 0.794 0.857 0.352 ...
nbin <- glmmTMB(Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory + 
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + Avg.Bat.Temp +
    Bat.Elevation + Bat.Moon + Bat.Water.Feat + Avg.Biomass + Num.Orders + 
    Avg.Bat.Temp*Avg.Bat.Date + Avg.Biomass*Year + Year + Habitat + 
    offset(log(Survey.Nights)) + (1|Site), 
    data = insect.data, 
    ziformula = ~0, 
    family = nbinom2)

summary(nbin)
 
Family: nbinom2  ( log )
Formula:          Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory +  
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + 
    Avg.Bat.Temp + Bat.Elevation + Bat.Moon + Bat.Water.Feat +
    Avg.Biomass + Num.Orders + Avg.Bat.Temp * Avg.Bat.Date +  
    Avg.Biomass * Year + Year + Habitat + offset(log(Survey.Nights)) +      
(1 | Site)
Data: insect.data

     AIC      BIC   logLik deviance df.resid 
     539      588     -247      495       47 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 Site   (Intercept) 2.44e-09 4.94e-05
Number of obs: 69, groups:  Site, 36

Overdispersion parameter for nbinom2 family (): 2.47 

Conditional model:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  0.526      0.572    0.92  0.35763    
Avg.Biomass                 -1.866      0.390   -4.78  1.7e-06 ***
Num.Orders                   0.876      0.136    6.44  1.2e-10 ***
Avg.Understory               0.431      0.120    3.58  0.00034 ***
Avg.Midstory                -2.148      0.319   -6.72  1.8e-11 ***
Avg.Canopy.Cover             0.465      0.190    2.45  0.01420 *  
Perc.Dec.Dom                 0.498      0.181    2.74  0.00606 ** 
Avg.Snags                    0.694      0.142    4.88  1.1e-06 ***
Avg.Bat.Date                 0.110      0.169    0.65  0.51553    
Avg.Bat.Temp                -0.197      0.205   -0.96  0.33524    
Bat.Elevation               -0.360      0.126   -2.86  0.00429 ** 
Bat.Moon                     0.541      0.111    4.85  1.2e-06 ***
Bat.Water.FeatRiver         -0.315      0.559   -0.56  0.57312    
Bat.Water.FeatStream         7.018      1.330    5.28  1.3e-07 ***
Year2018                     0.169      0.312    0.54  0.58789    
HabitatMM                    0.185      0.383    0.48  0.62982    
HabitatMMF                   0.146      0.348    0.42  0.67448    
HabitatREGEN                 1.121      0.356    3.15  0.00164 ** 
Avg.Bat.Date:Avg.Bat.Temp   -0.392      0.196   -2.00  0.04514 *  
Avg.Biomass:Year2018         1.500      0.375    4.00  6.2e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

enter image description here

res <- simulateResiduals(nbin)
plot(res,rank = T)

testResiduals(res)

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided

> testResiduals(res)
$uniformity

    One-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test

data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided


$uniformity

    One-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test

data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided][1]][1]

Then, I wanted to manually check dispersion and here's where I ran into some concerns

m1 <- nbin
dispfun <- function(m) {
        r <- residuals(m,type="pearson")
        n <- df.residual(m)
        dsq <- sum(r^2)
        c(dsq=dsq,n=n,disp=dsq/n)
}
options(digits=2)
dispfun(m1)

dsq    n disp 
76.1 47.0  1.6

This seems to indicate overdispersion in my model, however, I've already added covariates (as you can see, my model is quite complex and this is after dropping non-significant factors), and adding interactions (Hilbe 2011 suggestions). However, the DHARMa residuals look fairly decent. Which should I trust? Does anyone have any suggestions on how to handle this?

I reran with GLMMadaptive and got the following output with a different dispersion parameter:

Call:
mixed_model(fixed = Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + 
    Avg.Midstory + Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + 
    Avg.Bat.Date + Avg.Bat.Temp + Bat.Elevation + Bat.Moon + 
    Bat.Water.Feat + Avg.Biomass + Num.Orders + Avg.Bat.Temp * 
    Avg.Bat.Date + Avg.Biomass * Yr + Num.Orders * Yr + Avg.Bat.Date * 
    Bat.Moon + Yr + Habitat + offset(log(Survey.Nights)), random = (~1 | 
    Site), data = insect.data2, family = negative.binomial(), 
    iter_EM = 300)

Data Descriptives:
Number of Observations: 67
Number of Groups: 36 

Model:
 family: negative binomial
 link: log 

Fit statistics:
   log.Lik      AIC      BIC
 -230.2856 508.5711 546.5756

Random effects covariance matrix:
               StdDev
(Intercept) 0.0514579

Fixed effects:
                          Estimate Std.Err z-value    p-value
(Intercept)                 0.7447  0.5482  1.3584 0.17434114
Avg.Biomass                -1.5392  0.3861 -3.9871    < 1e-04
Num.Orders                  0.4840  0.1862  2.5987 0.00935661
Avg.Understory              0.2471  0.1299  1.9023 0.05713095
Avg.Midstory               -2.3953  0.3624 -6.6098    < 1e-04
Avg.Canopy.Cover            0.6657  0.1879  3.5422 0.00039685
Perc.Dec.Dom                0.5743  0.1737  3.3059 0.00094668
Avg.Snags                   0.5411  0.1494  3.6217 0.00029270
Avg.Bat.Date               -0.0040  0.1860 -0.0217 0.98266247
Avg.Bat.Temp               -0.7496  0.2795 -2.6818 0.00732270
Bat.Elevation              -0.3307  0.1270 -2.6032 0.00923670
Bat.Moon                    0.5336  0.1206  4.4251    < 1e-04
Bat.Water.FeatRiver        -0.7486  0.5586 -1.3402 0.18017727
Bat.Water.FeatStream        7.1474  1.4996  4.7663    < 1e-04
Yr2018                      0.4797  0.3066  1.5643 0.11774826
HabitatMM                  -0.0861  0.3768 -0.2285 0.81928969
HabitatMMF                 -0.3509  0.3605 -0.9735 0.33030629
HabitatREGEN                1.0362  0.3399  3.0486 0.00229947
Avg.Bat.Date:Avg.Bat.Temp  -0.6803  0.2172 -3.1324 0.00173393
Avg.Biomass:Yr2018          1.1956  0.3758  3.1815 0.00146534
Num.Orders:Yr2018           0.6276  0.2661  2.3584 0.01835350
Avg.Bat.Date:Bat.Moon       0.3587  0.1782  2.0130 0.04411454

log(dispersion) parameter:
  Estimate Std.Err
    1.0421  0.2256

Integration:
method: adaptive Gauss-Hermite quadrature rule
quadrature points: 11

Optimization:
method: hybrid EM and quasi-Newton
converged: TRUE 

My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors.

Best Answer

A couple of points:

  • The variance of the random effect for site is extremely low. This could either mean that there is no correlations in the bat activity within a site or that could be an artefact of the Laplace approximation used behind glmmTMB() to approximate the integrals of the random effects. You could also try fitting the same model with the GLMMadaptive package that approximates the same integrals with the adaptive Gaussian quadrature procedure that can be more accurate. You can find examples here and here.
  • It would be better to check the fit of the model, and possible remaining over-dispersion using the scaled simulated residuals of the DHARMa package. An example using this package to check the fit of a negative binomial model can be found here.
  • It would be better to define the variables as factors beforehand and not inside the formula. Moreover, are you sure that you need all these interaction terms?
Related Question