Solved – Interpretation and validation of glmmTMB for ecological count data

count-datageneralized linear modelglmmtmbnegative-binomial-distributionspatio-temporal

I have a couple of questions regarding the analysis of count data using glmms, specifically glmmTMB. I have a four year data set (starting Mar2013 – Dec2016) of insect count data, collected at weekly intervals (allowing for some missing weeks due to bad weather etc..). The same ten transects (or plots) were sampled each week – as well as the insect counts, various environmental data were also recorded at time of survey (for each transect).

I want to investigate the impact of various environmental variables on the presence/absence and abundance of the various life-stages of the insect.
There are approximately 1811 data points, six predictor variables and 2 variables treated as random effects (date and transect).
An example of the dataset and an abundance model:

  head(mydata)
# A tibble: 6 x 19
  year season month   week    day       date tsurvey precip cloudcover transect vegheight       veg1        veg2
<fctr>  <chr> <chr> <fctr> <fctr>     <fctr>   <dbl>  <dbl>      <chr>   <fctr>     <dbl>      <chr>       <chr>
1   2013 spring   Mar     12     79 20/03/2013     7.6     18    partial       T1         6      grass  leaflitter 
2   2013 spring   Mar     12     79 20/03/2013     8.8     18    partial       T2         9      grass  leaflitter
3   2013 spring   Mar     12     79 20/03/2013     9.0     18    partial       T3         8      grass deadbracken
4   2013 spring   Mar     12     79 20/03/2013     8.4     18    partial       T4         8      grass  leaflitter
5   2013 spring   Mar     12     79 20/03/2013     7.7     18    partial       T5         1 leaflitter       grass
6   2013 spring   Mar     12     79 20/03/2013     7.5     18    partial       T6         5      grass  leaflitter
7   2013 spring   Mar     12     79 20/03/2013     7.7     18    partial       T7         5      grass  leaflitter
8   2013 spring   Mar     12     79 20/03/2013     7.7     18    partial       T8         5      grass  leaflitter
9   2013 spring   Mar     12     79 20/03/2013     7.5     18    partial       T9        10      grass  bramble
10   2013 spring   Mar     12     79 20/03/2013     7.5     18    partial       T10         5      grass  leaflitter
11   2013 spring   Mar     13     86 27/03/2013     5.8     5    partial       T1         22      grass  leaflitter

  m3TMB<-glmmTMB(nymphs ~ tsurvey + precip + vegheight + cloudcover +veg1 + 
 (1|transect) + (1|date), family = nbinom2, data = mydata) 
> summary(m3TMB)           
 Family: nbinom2  ( log )
Formula:          nymphs ~ tsurvey + precip + vegheight + cloudcover + veg1 + (1|transect) + (1|date)
Data: mydata
 AIC      BIC   logLik deviance df.resid 
3958     4052    -1962     3924     1771 
Random effects:
Conditional model:
 Groups   Name        Variance Std.Dev.
 date     (Intercept) 1.841    1.357
 transect (Intercept) 0.128    0.358
Number of obs: 1788, groups:  date, 181; transect, 10
Overdispersion parameter for nbinom2 family ():  1.2 
Conditional model:
Estimate Std. Error z value Pr(>|z|) 
(Intercept)       -1.09e+00   1.23e+00   -0.89    0.375 
tsurvey            7.16e-02   1.65e-02    4.35  1.4e-05 ***   
precip            -1.86e-02   7.37e-03   -2.53    0.011 * 
vegheight         -3.57e-02   5.17e-03   -6.91  4.9e-12 ***
cloudcovercloudy   3.43e-01   2.78e-01    1.23    0.219   
cloudcoverpartial  4.64e-01   3.01e-01    1.54    0.123
cloudcoverunknown  3.00e-01   6.49e-01    0.46    0.643
veg1bracken       -7.33e-01   1.38e+00   -0.53    0.596 
veg1bramble       -2.31e-01   1.18e+00   -0.20    0.845 
veg1grass         -1.54e-01   1.18e+00   -0.13    0.896  
veg1leaf          -1.76e+00   1.26e+00   -1.40    0.162   
veg1leaflitter    -9.64e-01   1.19e+00   -0.81    0.416  
veg1Leaft litter  -1.35e+01   2.16e+03   -0.01    0.995    
veg1nettle        -1.29e-01   1.24e+00   -0.10    0.917    
---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   

Some issues with the data: it is over-dispersed; the response variable has a mean of 1 and variance of 7. There is a strong seasonal element, I am not really sure how /if this needs to be dealt with in a specific manor? There are a lot of zeros in the response variable but I think these are genuine zero counts and so not zero inflated (I hope I am understanding this correctly).

Initially I looked at the final count for each weekly survey i.e. analysis of the total count across all ten transects, but I realised that I was losing a lot of information by doing this ( e.g. the temperature, vegetation and insect count varied in each transect). I decided it would be best to use the data from each transect (T1 to T10). I investigated various methods – glmm seemed best, since there was temporal and spatial pseudoreplication. I did also look into GAMMs, which I thought were perhaps the way forward but found various tutorials/posts that advised extreme caution in their interpretation, especially of their p-values and CI’s. So I turned my attention back to GLMMS – I tried various packages, focusing in on glmmADMB, glmmmgcv and finally; glmmTMB. The latter seemed like it would handle the random effects of date and transect well.

I am not a statistician (surprise) and still learning a lot when it comes to R, stats and modelling – I am aware of the pontential misinterpretation of model outputs (as described in Bolker et al. 2009) and although I understand the limitation of models, I would like to avoid the pit falls of misinterpretation (as much as that is possible!).

Having come this far, my main issues are the interpretation of the output and the validation of the model. Where do I go from here with the validation of this model…if it does indeed make sense for my data? I’ve found some worked examples online but none seem to go through all of the steps to the actual reporting stage – this would be extremely helpful to those learning.

I tried using the DARMHa package for residual diagnostics but it doesn’t seem to support this class of model.

I plotted the residuals (see below) but I am unsure if this the correct way to interpret residuals from a glmmTMB :

m_3sresid <- resid(m3TMB, "pearson")
plot(m_3sresid)

I feel like I've been going in circles for a long time and would appreciate any suggestions from those more experienced or pointed in the direction of some complete worked examples, specially relating to glmmTMB model validation. I’ve gotten a lot of tips on data analysis from Cross Validated in the past but this is my first time asking a question – please let me know if further clarification is required.

Best Answer

As far as I know, there is no way around DHARMa (disclaimer: I maintain the package) or a similar simulation-based approach to achieve a general and robust residual check for this type of GLMMs. See discussion in the DHARMa vignette why Pearson residuals are not a good idea.

So far, I had not included glmmTMB in DHARMa because of the lack of a simulate function, but I just checked and realized that they have implemented such a function, so I should be able to add glmmTMB support in DHARMa in the next days. See updates about this and an interim solution here.

Edit 30/03/18: a test version is now available, see install instructions here. There are still a few smaller compatibility issues, but I would say it's usable for research, with due caution - I will mix these changes into the CRAN version of DHARMa as soon as all issues are fixed. Until this is mixed into CRAN, install the test version as described in the issue linked above.

Related Question