I have a couple of questions regarding the analysis of count data using glmms, specifically glmmTMB. I have a four year data set (starting Mar2013 – Dec2016) of insect count data, collected at weekly intervals (allowing for some missing weeks due to bad weather etc..). The same ten transects (or plots) were sampled each week – as well as the insect counts, various environmental data were also recorded at time of survey (for each transect).
I want to investigate the impact of various environmental variables on the presence/absence and abundance of the various life-stages of the insect.
There are approximately 1811 data points, six predictor variables and 2 variables treated as random effects (date and transect).
An example of the dataset and an abundance model:
head(mydata)
# A tibble: 6 x 19
year season month week day date tsurvey precip cloudcover transect vegheight veg1 veg2
<fctr> <chr> <chr> <fctr> <fctr> <fctr> <dbl> <dbl> <chr> <fctr> <dbl> <chr> <chr>
1 2013 spring Mar 12 79 20/03/2013 7.6 18 partial T1 6 grass leaflitter
2 2013 spring Mar 12 79 20/03/2013 8.8 18 partial T2 9 grass leaflitter
3 2013 spring Mar 12 79 20/03/2013 9.0 18 partial T3 8 grass deadbracken
4 2013 spring Mar 12 79 20/03/2013 8.4 18 partial T4 8 grass leaflitter
5 2013 spring Mar 12 79 20/03/2013 7.7 18 partial T5 1 leaflitter grass
6 2013 spring Mar 12 79 20/03/2013 7.5 18 partial T6 5 grass leaflitter
7 2013 spring Mar 12 79 20/03/2013 7.7 18 partial T7 5 grass leaflitter
8 2013 spring Mar 12 79 20/03/2013 7.7 18 partial T8 5 grass leaflitter
9 2013 spring Mar 12 79 20/03/2013 7.5 18 partial T9 10 grass bramble
10 2013 spring Mar 12 79 20/03/2013 7.5 18 partial T10 5 grass leaflitter
11 2013 spring Mar 13 86 27/03/2013 5.8 5 partial T1 22 grass leaflitter
m3TMB<-glmmTMB(nymphs ~ tsurvey + precip + vegheight + cloudcover +veg1 +
(1|transect) + (1|date), family = nbinom2, data = mydata)
> summary(m3TMB)
Family: nbinom2 ( log )
Formula: nymphs ~ tsurvey + precip + vegheight + cloudcover + veg1 + (1|transect) + (1|date)
Data: mydata
AIC BIC logLik deviance df.resid
3958 4052 -1962 3924 1771
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
date (Intercept) 1.841 1.357
transect (Intercept) 0.128 0.358
Number of obs: 1788, groups: date, 181; transect, 10
Overdispersion parameter for nbinom2 family (): 1.2
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.09e+00 1.23e+00 -0.89 0.375
tsurvey 7.16e-02 1.65e-02 4.35 1.4e-05 ***
precip -1.86e-02 7.37e-03 -2.53 0.011 *
vegheight -3.57e-02 5.17e-03 -6.91 4.9e-12 ***
cloudcovercloudy 3.43e-01 2.78e-01 1.23 0.219
cloudcoverpartial 4.64e-01 3.01e-01 1.54 0.123
cloudcoverunknown 3.00e-01 6.49e-01 0.46 0.643
veg1bracken -7.33e-01 1.38e+00 -0.53 0.596
veg1bramble -2.31e-01 1.18e+00 -0.20 0.845
veg1grass -1.54e-01 1.18e+00 -0.13 0.896
veg1leaf -1.76e+00 1.26e+00 -1.40 0.162
veg1leaflitter -9.64e-01 1.19e+00 -0.81 0.416
veg1Leaft litter -1.35e+01 2.16e+03 -0.01 0.995
veg1nettle -1.29e-01 1.24e+00 -0.10 0.917
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Some issues with the data: it is over-dispersed; the response variable has a mean of 1 and variance of 7. There is a strong seasonal element, I am not really sure how /if this needs to be dealt with in a specific manor? There are a lot of zeros in the response variable but I think these are genuine zero counts and so not zero inflated (I hope I am understanding this correctly).
Initially I looked at the final count for each weekly survey i.e. analysis of the total count across all ten transects, but I realised that I was losing a lot of information by doing this ( e.g. the temperature, vegetation and insect count varied in each transect). I decided it would be best to use the data from each transect (T1 to T10). I investigated various methods – glmm seemed best, since there was temporal and spatial pseudoreplication. I did also look into GAMMs, which I thought were perhaps the way forward but found various tutorials/posts that advised extreme caution in their interpretation, especially of their p-values and CI’s. So I turned my attention back to GLMMS – I tried various packages, focusing in on glmmADMB, glmmmgcv and finally; glmmTMB. The latter seemed like it would handle the random effects of date and transect well.
I am not a statistician (surprise) and still learning a lot when it comes to R, stats and modelling – I am aware of the pontential misinterpretation of model outputs (as described in Bolker et al. 2009) and although I understand the limitation of models, I would like to avoid the pit falls of misinterpretation (as much as that is possible!).
Having come this far, my main issues are the interpretation of the output and the validation of the model. Where do I go from here with the validation of this model…if it does indeed make sense for my data? I’ve found some worked examples online but none seem to go through all of the steps to the actual reporting stage – this would be extremely helpful to those learning.
I tried using the DARMHa package for residual diagnostics but it doesn’t seem to support this class of model.
I plotted the residuals (see below) but I am unsure if this the correct way to interpret residuals from a glmmTMB :
m_3sresid <- resid(m3TMB, "pearson")
plot(m_3sresid)
I feel like I've been going in circles for a long time and would appreciate any suggestions from those more experienced or pointed in the direction of some complete worked examples, specially relating to glmmTMB model validation. I’ve gotten a lot of tips on data analysis from Cross Validated in the past but this is my first time asking a question – please let me know if further clarification is required.
Best Answer
As far as I know, there is no way around DHARMa (disclaimer: I maintain the package) or a similar simulation-based approach to achieve a general and robust residual check for this type of GLMMs. See discussion in the DHARMa vignette why Pearson residuals are not a good idea.
So far, I had not included glmmTMB in DHARMa because of the lack of a simulate function, but I just checked and realized that they have implemented such a function, so I should be able to add glmmTMB support in DHARMa in the next days. See updates about this and an interim solution here.
Edit 30/03/18: a test version is now available, see install instructions here. There are still a few smaller compatibility issues, but I would say it's usable for research, with due caution - I will mix these changes into the CRAN version of DHARMa as soon as all issues are fixed. Until this is mixed into CRAN, install the test version as described in the issue linked above.