Solved – GLMM with poisson distribution for non-integer “count” data

count-dataglmmglmmtmbpoisson distribution

Excuse me if this is duplicated. I've poked around this and other sites and have found some good info about glmm's and Poisson distributions, however my case seems a bit different.

I am currently analyzing soil nematode community data. My dependent variables are total number of nematodes (counted in petri dish with dissecting scope) per gram of dry soil, and the number of each functional group (bacterivore, fungivore, plant parasite, omnivore, predator) per gram of dry soil. Total number of nematodes is a whole number, however, the way we typically estimate abundance of each functional groups results in non-integer data. So for total nematodes, I can use soil weight as the offset, however it doesn't seem I can do that with the functional groups because the "count" data I generated is still not an integer. A further issue is my higher trophic level, omnivores and predators, data is full of zeros.

My data looks like this. I'll just include total nematodes and estimated predators for ease.

#nem soil_g nem/gsoil   #ffg    #Pr prop_pr pred/gsoil  "pr_count"  vegtype

52  25.60   2.031        37      0   0.000    0.000       0.000     maple
9   27.73   0.325         7      0   0.000    0.000       0.000     maple
2   26.91   0.074         2      1   0.500    0.037       1.000     maple
50  21.55   2.320        27      0   0.000    0.000       0.000     maple
38  18.55   2.049        23      0   0.000    0.000       0.000     maple
87  5.71    15.236       50     11   0.220    3.352      19.140     alder
110 13.87   7.931       101      2   0.020    0.157       2.200     alder
174 19.10   9.110       116      7   0.060    0.550      10.440     alder
54  24.97   2.163        54      1   0.018    0.039       0.972     alder

Here, #nem is the total nematode abundance, soil_g is the dry weight of soil that those nematodes were extracted from, nem/gsoil is the number of nematodes per gram of dry soil, #ffg is the total number of nematodes identified to functional feeding group, #Pr is the number of nematodes identified as predator, prop_pr is #Pr/#ffg, pred/gsoil is prop_pr * nem/gsoil, pr_count is prop_pr * #nem, and vegtype is the tree from which the soil sample came from.

As you can see, my predator "counts" are estimates and produce non-integer data. This isn't shown here, but many are between 1 and 0, so I don't want to simply round the data.

I want to use a poisson distribution and include random effects (stand). I used the following model in r:

tmb1<- glmmTMB(pr_count~vegtype + offset(log(Est_drysoil_g))+(1|stand), data=nem, ziformula=~1, family=poisson)

I tried this with glmer from the lme4 package, but it wouldn't work. I assume this is because the non-integer dependent variable. However, when I used glmmTMB it seems to work. I mainly just want to make sure I'm not doing anything too crazy here. Out of 225 soil samples (5 from each of 45 stands), there are 121 samples with 0 predators. Eventually, I will include other continuous data into this model such as soil moisture and soil texture(%sand).

Here is the output of the above model:

    Family: poisson  ( log )
Formula:          pr ~ vegtype + offset(log(Est_drysoil_g)) + (1 | stand)
Zero inflation:      ~1
Data: nem

    AIC      BIC   logLik deviance df.resid 
   848.2    861.8   -420.1    840.2      221 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 stand  (Intercept) 0.9056   0.9516  
 Number of obs: 225, groups:  stand, 45

Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.6571     0.2903  -5.709 1.14e-08 ***
vegtypeMaple  -0.8612     0.3515  -2.450   0.0143 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Zero-inflation model:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.5508     0.2275  -2.421   0.0155 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here is a histogram of my predator data[![predator histogram][1]][1]
[1]: https://i.stack.imgur.com/7mu7P.jpg

My question is:

  1. Am I doing this right? or have I committed some fatal statistical crime?
  2. Am I right in choosing the poisson family?
  3. Any suggestions would be greatly appreciated!

Thanks all,

Wendal Kane

Best Answer

To answer your questions, Wendal:

1) You are not doing this right - your outcome data are semi-continuous (i.e., a combination of a point-mass at zero and a positive skewed continuous distribution), but you are analyzing the data as if they are count data. So you have committed a fatal statistical crime. 😜

2) The Poisson distribution is appropriate for modeling count data, but not for modeling semi-continuous data. More appropriate distributions for data such as yours would be the log-normal distribution or the Gamma distribution.

3) You could model your semi-continuous data using either zero-inflated models or hurdle models in conjunction with one of the two suggested distributions (i.e., log-normal, Gamma). Sean Anderson gives a nice description on when you might wish to consider each type of model: http://seananderson.ca/2014/05/18/gamma-hurdle/. See also this thread for concrete examples of when each type of model would make sense: What is the difference between zero-inflated and hurdle models?

Related Question