Solved – GLMM with poisson distribution for non-integer “count” data

count-dataglmmglmmtmbpoisson distribution

Excuse me if this is duplicated. I've poked around this and other sites and have found some good info about glmm's and Poisson distributions, however my case seems a bit different.

I am currently analyzing soil nematode community data. My dependent variables are total number of nematodes (counted in petri dish with dissecting scope) per gram of dry soil, and the number of each functional group (bacterivore, fungivore, plant parasite, omnivore, predator) per gram of dry soil. Total number of nematodes is a whole number, however, the way we typically estimate abundance of each functional groups results in non-integer data. So for total nematodes, I can use soil weight as the offset, however it doesn't seem I can do that with the functional groups because the "count" data I generated is still not an integer. A further issue is my higher trophic level, omnivores and predators, data is full of zeros.

My data looks like this. I'll just include total nematodes and estimated predators for ease.

#nem soil_g nem/gsoil   #ffg    #Pr prop_pr pred/gsoil  "pr_count"  vegtype

52  25.60   2.031        37      0   0.000    0.000       0.000     maple
9   27.73   0.325         7      0   0.000    0.000       0.000     maple
2   26.91   0.074         2      1   0.500    0.037       1.000     maple
50  21.55   2.320        27      0   0.000    0.000       0.000     maple
38  18.55   2.049        23      0   0.000    0.000       0.000     maple
87  5.71    15.236       50     11   0.220    3.352      19.140     alder
110 13.87   7.931       101      2   0.020    0.157       2.200     alder
174 19.10   9.110       116      7   0.060    0.550      10.440     alder
54  24.97   2.163        54      1   0.018    0.039       0.972     alder

Here, #nem is the total nematode abundance, soil_g is the dry weight of soil that those nematodes were extracted from, nem/gsoil is the number of nematodes per gram of dry soil, #ffg is the total number of nematodes identified to functional feeding group, #Pr is the number of nematodes identified as predator, prop_pr is #Pr/#ffg, pred/gsoil is prop_pr * nem/gsoil, pr_count is prop_pr * #nem, and vegtype is the tree from which the soil sample came from.

As you can see, my predator "counts" are estimates and produce non-integer data. This isn't shown here, but many are between 1 and 0, so I don't want to simply round the data.

I want to use a poisson distribution and include random effects (stand). I used the following model in r:

tmb1<- glmmTMB(pr_count~vegtype + offset(log(Est_drysoil_g))+(1|stand), data=nem, ziformula=~1, family=poisson)

I tried this with glmer from the lme4 package, but it wouldn't work. I assume this is because the non-integer dependent variable. However, when I used glmmTMB it seems to work. I mainly just want to make sure I'm not doing anything too crazy here. Out of 225 soil samples (5 from each of 45 stands), there are 121 samples with 0 predators. Eventually, I will include other continuous data into this model such as soil moisture and soil texture(%sand).

Here is the output of the above model:

    Family: poisson  ( log )
Formula:          pr ~ vegtype + offset(log(Est_drysoil_g)) + (1 | stand)
Zero inflation:      ~1
Data: nem

    AIC      BIC   logLik deviance df.resid 
   848.2    861.8   -420.1    840.2      221 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 stand  (Intercept) 0.9056   0.9516  
 Number of obs: 225, groups:  stand, 45

Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.6571     0.2903  -5.709 1.14e-08 ***
vegtypeMaple  -0.8612     0.3515  -2.450   0.0143 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Zero-inflation model:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.5508     0.2275  -2.421   0.0155 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here is a histogram of my predator data[![predator histogram][1]][1]
[1]: https://i.stack.imgur.com/7mu7P.jpg

My question is:

Am I doing this right? or have I committed some fatal statistical crime?
Am I right in choosing the poisson family?
Any suggestions would be greatly appreciated!

Thanks all,

Wendal Kane

Best Answer

To answer your questions, Wendal:

1) You are not doing this right - your outcome data are semi-continuous (i.e., a combination of a point-mass at zero and a positive skewed continuous distribution), but you are analyzing the data as if they are count data. So you have committed a fatal statistical crime. 😜

2) The Poisson distribution is appropriate for modeling count data, but not for modeling semi-continuous data. More appropriate distributions for data such as yours would be the log-normal distribution or the Gamma distribution.

3) You could model your semi-continuous data using either zero-inflated models or hurdle models in conjunction with one of the two suggested distributions (i.e., log-normal, Gamma). Sean Anderson gives a nice description on when you might wish to consider each type of model: http://seananderson.ca/2014/05/18/gamma-hurdle/. See also this thread for concrete examples of when each type of model would make sense: What is the difference between zero-inflated and hurdle models?

Related Solutions

Solved – Should I use Poisson distribution for non-integer, count-like data

1) Is it correct to use a GLMM with Poisson distribution with such data? (I don't think so but glmer seems to work anyway)

No, it is not correct. By "count data" we generally mean data that records number of cases, so it can be only non-negative and integer-valued. The same is with Poisson distribution, that is a distribution for non-negative integer-valued data. Under Poisson distribution probability of observing non-integer value is zero and R behaves accordingly to it:

dpois(c(1, 1.5, 2, 2.5, 3), 5)
## [1] 0.03368973 0.00000000 0.08422434 0.00000000 0.14037390
## Warning messages:
## 1: In dpois(c(1, 1.5, 2, 2.5, 3), 5) : non-integer x = 1.500000
## 2: In dpois(c(1, 1.5, 2, 2.5, 3), 5) : non-integer x = 2.500000

You can estimate log-linear glmm using this data but assuming Poisson distribution means that you treat all the non-integers as improbable values so R throws appropriate warnings. This means that the estimates of log-likelihood and the ones based on it, like AIC, won't be what you want them to be.

This doesn't mean that you cannot estimate log-linear regression with non-integer data. You can, but you can't assume Poisson distribution for such data.

See also What regression model is the most appropriate to use with count data? thread (check also the discussion in comments below the answer) and How does a Poisson distribution work when modeling continuous data and does it result in information loss? .

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

I believe there are some important problems to be addressed with your estimation.

From what I gathered by examining your data, your units are not geographically grouped, i.e. census tracts within counties. Thus, using tracts as a grouping factor is not appropriate to capture spatial heterogeneity as this means that you have the same number of individuals as groups (or to put another way, all your groups have only one observation each). Using a multilevel modelling strategy allows us to estimate individual-level variance, while controlling for between-group variance. Since your groups have only one individual each, your between-group variance is the same as your individual-level variance, thus defeating the purpose of the multilevel approach.

On the other hand, the grouping factor can represent repeated measurements over time. For example, in the case of a longitudinal study, an individual's "maths" scores may be recored yearly, thus we would have a yearly value for each student for n years (in this case, the grouping factor is the student as we have n number of observations "nested" within students). In your case, you have repeated measures of each census tract by decade. Thus, you could use your TRTID10 variable as a grouping factor to capture "between decade variance". This leads to 3142 observations nested in 635 tracts, with approximately 4 and 5 observations per census tract.

As mentioned in a comment, using decade as a grouping factor is not very appropriate, as you have only around 5 decades for each census tract, and their effect can be better captured introducing decade as a covariate.

Second, to determine whether your data ought to be modelled using a poisson or negative binomial model (or a zero inflated approach). Consider the amount of overdispersion in your data. The fundamental characteristic of a Poisson distribution is equidispersion, meaning that the mean is equal to the variance of the distribution. Looking at your data, it is pretty clear that there is much overdispersion. Variances are much much greater than the means.

library(dplyr)    
 dispersionstats <- scaled.mydata %>%
 + group_by(decade) %>%
 + summarise(
 + means = mean(R_VAC),
 + variances = var(R_VAC),
 + ratio = variances/means)

##   dispersionstats
##   # A tibble: 5 x 5
##   decade     means variances     ratio 
##    <int>     <dbl>     <dbl>     <dbl> 
## 1   1970  45.43513   4110.89  90.47822 
## 2   1980 103.52365  17323.34 167.33707 
## 3   1990 177.68038  62129.65 349.67087 
## 4   2000 190.23150  91059.60 478.67784 
## 5   2010 247.68246 126265.60 509.78821

Nonetheless, to determine if the negative binomial is more appropriate statistically, a standard method is to do a likelihood ratio test between a Poisson and a negative binomial model, which suggests that the negbin is a better fit.

library(MASS)
library(lmtest)

modelformula <- formula(R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln))

poismodel <- glm(modelformula, data = scaled.mydata, family = "poisson")   
nbmodel <- glm.nb(modelformula, data = scaled.mydata)

lrtest(poismodel, nbmodel)

## Likelihood ratio test

##  Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)  
## Model 2: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   8 -154269
## 2   9  -17452  1 273634  < 2.2e-16 ***
##  ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

After establishing this, a next test could consider whether the multilevel (mixed model) approach is warranted using similar approach, which suggests that the multilevel version provides a better fit. (A similar test could be used to compare a glmer fit assuming a poisson distribution to the glmer.nb object, as long as the models are otherwise the same.)

library(lme4)

glmmformula <- update(modelformula, . ~ . + (1|TRTID10))

nbglmm <- glmer.nb(glmmformula, data = scaled.mydata)

lrtest(nbmodel, nbglmm)

## Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
## Model 2: R_VAC ~ factor(decade) + P_NONWHT + a_hinc + (1 | TRTID10) +
##     P_NONWHT:a_hinc + offset(HU_ln)
##   #Df LogLik Df Chisq Pr(>Chisq)
## 1   9 -17452
## 2  10 -17332  1 239.3  < 2.2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Regarding the estimates of the poisson and nb models they are actually expected to be very similar to each other, with the main distinction being the standard errors, ie if overdispersion is present, the poisson model tends to provide biased standard errors. Taking your data as an example:

poissonglmm <- glmer(glmmformula, data = scaled.mydata)
summary(poissonglmm)

## Random effects:
##  Groups  Name        Variance Std.Dev.
## TRTID10 (Intercept) 0.2001   0.4473
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.876013   0.020602 -139.60   <2e-16 ***
## factor(decade)1980  0.092597   0.007602   12.18   <2e-16 ***
## factor(decade)1990  0.903543   0.007045  128.26   <2e-16 ***
## factor(decade)2000  0.854821   0.006913  123.65   <2e-16 ***
## factor(decade)2010  0.986126   0.006723  146.67   <2e-16 ***
## P_NONWHT           -0.125500   0.014007   -8.96   <2e-16 ***
## a_hinc             -0.107335   0.001480  -72.52   <2e-16 ***
## P_NONWHT:a_hinc     0.160937   0.003117   51.64   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary(nbglmm)
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  TRTID10 (Intercept) 0.09073  0.3012
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.797861   0.056214  -49.77  < 2e-16 ***
## factor(decade)1980  0.118588   0.039589    3.00  0.00274 **
## factor(decade)1990  0.903440   0.038255   23.62  < 2e-16 ***
## factor(decade)2000  0.843949   0.038172   22.11  < 2e-16 ***
## factor(decade)2010  1.068025   0.037376   28.58  < 2e-16 ***
## P_NONWHT            0.020012   0.089224    0.22  0.82253
## a_hinc             -0.129094   0.008109  -15.92  < 2e-16 ***
## P_NONWHT:a_hinc     0.149223   0.018967    7.87 3.61e-15 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Notice how the coefficient estimates are all very similar, the main difference being only the significance of one of your covariates, as well as the difference in the random effects variance, which suggests that the unit-level variance captured by the overdispersion parameter in the nb model (the theta value in the glmer.nb object) captures some of the between tract variance captured by the random effects.

Regarding exponentiated coefficients (and associated confidence intervals), you can use the following:

fixed <- fixef(nbglmm)
confnitfixed <- confint(nbglmm, parm = "beta_", method = "Wald") # Beware: The Wald method is less accurate but much, much faster.

# The exponentiated coefficients are also known as Incidence Rate Ratios (IRR)
IRR <- exp(cbind(fixed, confintfixed)
IRR
##                         fixed      2.5 %     97.5 %
## (Intercept)        0.06094028 0.05458271 0.06803835
## factor(decade)1980 1.12590641 1.04184825 1.21674652
## factor(decade)1990 2.46807856 2.28979339 2.66024515
## factor(decade)2000 2.32553168 2.15789585 2.50619029
## factor(decade)2010 2.90962703 2.70410073 3.13077444
## P_NONWHT           1.02021383 0.85653208 1.21517487
## a_hinc             0.87889172 0.86503341 0.89297205
## P_NONWHT:a_hinc    1.16093170 1.11856742 1.20490048

Final thoughts, regarding zero inflation. There is no multilevel implementation (at least that I am aware of) of a zero inflated poisson or negbin model that allows you to specify an equation for zero inflated component of the mixture. the glmmADMB model lets you estimate a constant zero inflation parameter. An alternative approach would be to use the zeroinfl function in the pscl package, though this does not support multilevel models. Thus, you could compare the fit of a single level negative binomial, to the single level zero inflated negative binomial. Chances are that if zero inflation is not significant for single level models, it is likely that it would not be significant for the multilevel specification.

Addendum

If you are concerned about spatial autocorrelation, you could control for this using some form of geographical weighted regression (though I believe this uses point data, not areas). Alternatively, you could group your census tracts by an additional grouping factor (states, counties) and include this as a random effect. Lastly, and I am not sure if this is entirely feasible, it may be possible to incorporate spatial dependence using, for example, the average count of R_VAC in first order neighbours as a covariate. In any case, prior to such approaches, it would be sensible to determine if spatial autocorrelation is indeed present (using Global Moran's I, LISA tests, and similar approaches).

Best Answer

Related Solutions

Solved – Should I use Poisson distribution for non-integer, count-like data

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

Addendum

Related Question