Solved – Zero-inflation on steroids: choose among Poisson, negative binomial and zero-inflated regressions

negative-binomial-distributionpoisson distributionrregressionzero inflation

I am struggling to fit alternative count models into my data. I guess my problem is just too many zeros.

This is my data

> summary(smpl)
    response        predict1          predict2        
 Min.   :0.000   Min.   :   1.00   Min.   :    22005  
 1st Qu.:0.000   1st Qu.:   3.00   1st Qu.:  4669705  
 Median :0.000   Median :   8.00   Median : 12540318  
 Mean   :0.017   Mean   :  23.27   Mean   : 20382574  
 3rd Qu.:0.000   3rd Qu.:  20.00   3rd Qu.: 25468156  
 Max.   :3.000   Max.   :1584.00   Max.   :145348049

> table(smpl$response)
  0   1   2   3 
987  10   2   1

I tried three regressions: basic Poisson, negative binomial and zero-inflated but the only formula returning coefficients without warnings is the Poisson:

> summary(glm(response ~ ., data = smpl, family = poisson))

Call:
glm(formula = response ~ ., family = poisson, data = smpl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3871  -0.2214  -0.1722  -0.1148   4.7861  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.472e+00  3.521e-01  -9.862  < 2e-16 ***
predict1     3.229e-03  7.271e-04   4.442 8.93e-06 ***
predict2    -6.258e-08  3.060e-08  -2.045   0.0409 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 150.67  on 999  degrees of freedom
Residual deviance: 135.84  on 997  degrees of freedom
AIC: 170.06

Number of Fisher Scoring iterations: 8

The negative binomial returns a warnings on both the convergence and the alternation limit

summary(glm.nb(response ~ ., data = smpl))

Call:
glm.nb(formula = response ~ ., data = smpl, init.theta = 0.04901296596, 
    link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.28844  -0.17677  -0.14542  -0.09808   2.38314  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.899e+00  4.587e-01  -8.499  < 2e-16 ***
predict1     1.226e-02  2.144e-03   5.720 1.06e-08 ***
predict2    -5.982e-08  3.407e-08  -1.756   0.0791 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.049) family taken to be 1)

    Null deviance: 69.927  on 999  degrees of freedom
Residual deviance: 55.940  on 997  degrees of freedom
AIC: 152.37

Number of Fisher Scoring iterations: 1


              Theta:  0.0490 
          Std. Err.:  0.0251 
Warning while fitting theta: alternation limit reached 

 2 x log-likelihood:  -144.3700 
Warning messages:
1: glm.fit: algorithm did not converge 
2: In glm.nb(response ~ ., data = smpl) : alternation limit reached

and the zero-inflated (from the pscl package) doesn't return anything at all

> summary(zeroinfl(response ~ ., data = smpl, dist = "negbin"))

Call:
zeroinfl(formula = response ~ ., data = smpl, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.45252 -0.08817 -0.05515 -0.04210 19.56118 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.477e+00         NA      NA       NA
predict1     2.678e-03         NA      NA       NA
predict2    -1.160e-07         NA      NA       NA
Log(theta)  -1.241e+00         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.869e+00         NA      NA       NA
predict1    -1.329e-01         NA      NA       NA
predict2    -1.346e-07         NA      NA       NA
Error in if (getOption("show.signif.stars") & any(rbind(x$coefficients$count,  : 
  missing value where TRUE/FALSE needed

Then my questions are:

Is there anything I can do in terms of "formula tweaking" with the negative binomial (to avoid the warnings) and with the zero-inflated (to get the coefficients)?
Looking only at the results above (thus including problems with convergence and alternation limit) should I select the negative binomial model since it seems, looking at the AIC, to fit better than the Poisson in my data?

Best Answer

You have 13 observations that are non-zero and only 3 observations that are greater than 1. Hence, I believe the best thing you can do is to use a binary regression only (0 vs > 0). Possibly it could help to add some form of regularization for that as well, e.g. bias reduction (see package brglm), because non-zeros are so rare.

Furthermore, I would recommend to bring the predict variables to a similar scale, e.g., dividing predict2 by 1e6 or so. Otherwise, numerical optimization algorithms have problems. This might improve convergence for glm.nb and zeroinfl (which by the way does not return nothing - there are coefficients but the covariance matrix estimate does not seem valid).

Related Solutions

Solved – Understanding over-dispersion as it relates to the Poisson and the Neg. Binomial

This answer comes in two parts. The first addresses the issue of the standard errors and why that implies the models are not identical, as @ndoogan observed in comments, and the second addresses, partially, the issue of when the coefficient estimates might be substantially different.

Consider, for example, a hypothesis test on the coefficient of log(Vmaj) where the null is that the coefficient equals 0.5. The two sets of model estimates would result in rejection of the null in the Poisson case, unless testing at a very high confidence level, and failure to reject the null in the NB case.

More generally, there is more to a collection of estimates than just the point estimates. In the presence of (Negative binomial-like) overdispersion, the standard errors produced by the Negative Binomial-based model will be more accurate estimates of the underlying standard deviation of the coefficient estimates. Thus, the NB model as a whole is more accurate.

For example, a simple model with $\mu_i = \exp\{1 + x_i\}$, and $y_i \sim NB$ with mean $\mu_i$ and variance $5\mu_i$:

N <- 1000
mu <- exp(1 + (x <- rnorm(N)))

p <- 0.2
r <- mu * p / (1-p)
y <- rnbinom(N, r, p)

poisson_summary = summary(glm(y~x, family="poisson"))
nb_summary = summary(glm.nb(y~x))

# Parametric bootstrap calculation of the s.e. of the coefficient of x

coef_x <- rep(0,1000)
for (i in 1:length(coef_x)) {
   mu <- exp(1 + (x <- rnorm(N)))
   r <- mu * p / (1-p)
   y <- rnbinom(N, r, p)
   coef_x[i] <- summary(glm(y~x, family="poisson"))$coefficients["x","Estimate"]
}
data.frame("Sim. SE" = sd(coef_x - 1), 
           "Poisson SE" = poisson_summary$coefficients["x", "Std. Error"],
       "N.B. SE" = nb_summary$coefficients["x", "Std. Error"])

     Sim..SE Poisson.SE    N.B..SE
1 0.03492467 0.01273354 0.03984743

where you can see that the simulated std. deviation of the coefficient estimate is roughly 3x the Poisson-based estimate of same, and much better estimated by the NB-based estimate of same.

The coefficient estimates are pretty much the same, as one might expect with this sample size and the std. errors above, so I won't take up space by displaying them.

Additionally, although this is often a minor effect when overdispersion is low, by weighting the observations with more accurate weights (i.e., better estimates of observation-specific variances), the parameter estimates themselves will be more accurate, well, asymptotically at any rate. The rule of thumb I learned long ago was that heteroskedasticity adjustments (for that is what they are, in essence) don't buy you much unless the differences between weights are on the order of 5x or more.

Note, however, that in small samples you may well get more accurate (in MSE terms) point estimates with the Poisson model if there isn't much overdispersion, because you are reducing the variability induced by estimating the dispersion parameter. Of course, this is almost certainly more than offset by the loss of accuracy in the standard errors of the coefficient estimates.

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

I believe there are some important problems to be addressed with your estimation.

From what I gathered by examining your data, your units are not geographically grouped, i.e. census tracts within counties. Thus, using tracts as a grouping factor is not appropriate to capture spatial heterogeneity as this means that you have the same number of individuals as groups (or to put another way, all your groups have only one observation each). Using a multilevel modelling strategy allows us to estimate individual-level variance, while controlling for between-group variance. Since your groups have only one individual each, your between-group variance is the same as your individual-level variance, thus defeating the purpose of the multilevel approach.

On the other hand, the grouping factor can represent repeated measurements over time. For example, in the case of a longitudinal study, an individual's "maths" scores may be recored yearly, thus we would have a yearly value for each student for n years (in this case, the grouping factor is the student as we have n number of observations "nested" within students). In your case, you have repeated measures of each census tract by decade. Thus, you could use your TRTID10 variable as a grouping factor to capture "between decade variance". This leads to 3142 observations nested in 635 tracts, with approximately 4 and 5 observations per census tract.

As mentioned in a comment, using decade as a grouping factor is not very appropriate, as you have only around 5 decades for each census tract, and their effect can be better captured introducing decade as a covariate.

Second, to determine whether your data ought to be modelled using a poisson or negative binomial model (or a zero inflated approach). Consider the amount of overdispersion in your data. The fundamental characteristic of a Poisson distribution is equidispersion, meaning that the mean is equal to the variance of the distribution. Looking at your data, it is pretty clear that there is much overdispersion. Variances are much much greater than the means.

library(dplyr)    
 dispersionstats <- scaled.mydata %>%
 + group_by(decade) %>%
 + summarise(
 + means = mean(R_VAC),
 + variances = var(R_VAC),
 + ratio = variances/means)

##   dispersionstats
##   # A tibble: 5 x 5
##   decade     means variances     ratio 
##    <int>     <dbl>     <dbl>     <dbl> 
## 1   1970  45.43513   4110.89  90.47822 
## 2   1980 103.52365  17323.34 167.33707 
## 3   1990 177.68038  62129.65 349.67087 
## 4   2000 190.23150  91059.60 478.67784 
## 5   2010 247.68246 126265.60 509.78821

Nonetheless, to determine if the negative binomial is more appropriate statistically, a standard method is to do a likelihood ratio test between a Poisson and a negative binomial model, which suggests that the negbin is a better fit.

library(MASS)
library(lmtest)

modelformula <- formula(R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln))

poismodel <- glm(modelformula, data = scaled.mydata, family = "poisson")   
nbmodel <- glm.nb(modelformula, data = scaled.mydata)

lrtest(poismodel, nbmodel)

## Likelihood ratio test

##  Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)  
## Model 2: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   8 -154269
## 2   9  -17452  1 273634  < 2.2e-16 ***
##  ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

After establishing this, a next test could consider whether the multilevel (mixed model) approach is warranted using similar approach, which suggests that the multilevel version provides a better fit. (A similar test could be used to compare a glmer fit assuming a poisson distribution to the glmer.nb object, as long as the models are otherwise the same.)

library(lme4)

glmmformula <- update(modelformula, . ~ . + (1|TRTID10))

nbglmm <- glmer.nb(glmmformula, data = scaled.mydata)

lrtest(nbmodel, nbglmm)

## Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
## Model 2: R_VAC ~ factor(decade) + P_NONWHT + a_hinc + (1 | TRTID10) +
##     P_NONWHT:a_hinc + offset(HU_ln)
##   #Df LogLik Df Chisq Pr(>Chisq)
## 1   9 -17452
## 2  10 -17332  1 239.3  < 2.2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Regarding the estimates of the poisson and nb models they are actually expected to be very similar to each other, with the main distinction being the standard errors, ie if overdispersion is present, the poisson model tends to provide biased standard errors. Taking your data as an example:

poissonglmm <- glmer(glmmformula, data = scaled.mydata)
summary(poissonglmm)

## Random effects:
##  Groups  Name        Variance Std.Dev.
## TRTID10 (Intercept) 0.2001   0.4473
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.876013   0.020602 -139.60   <2e-16 ***
## factor(decade)1980  0.092597   0.007602   12.18   <2e-16 ***
## factor(decade)1990  0.903543   0.007045  128.26   <2e-16 ***
## factor(decade)2000  0.854821   0.006913  123.65   <2e-16 ***
## factor(decade)2010  0.986126   0.006723  146.67   <2e-16 ***
## P_NONWHT           -0.125500   0.014007   -8.96   <2e-16 ***
## a_hinc             -0.107335   0.001480  -72.52   <2e-16 ***
## P_NONWHT:a_hinc     0.160937   0.003117   51.64   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary(nbglmm)
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  TRTID10 (Intercept) 0.09073  0.3012
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.797861   0.056214  -49.77  < 2e-16 ***
## factor(decade)1980  0.118588   0.039589    3.00  0.00274 **
## factor(decade)1990  0.903440   0.038255   23.62  < 2e-16 ***
## factor(decade)2000  0.843949   0.038172   22.11  < 2e-16 ***
## factor(decade)2010  1.068025   0.037376   28.58  < 2e-16 ***
## P_NONWHT            0.020012   0.089224    0.22  0.82253
## a_hinc             -0.129094   0.008109  -15.92  < 2e-16 ***
## P_NONWHT:a_hinc     0.149223   0.018967    7.87 3.61e-15 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Notice how the coefficient estimates are all very similar, the main difference being only the significance of one of your covariates, as well as the difference in the random effects variance, which suggests that the unit-level variance captured by the overdispersion parameter in the nb model (the theta value in the glmer.nb object) captures some of the between tract variance captured by the random effects.

Regarding exponentiated coefficients (and associated confidence intervals), you can use the following:

fixed <- fixef(nbglmm)
confnitfixed <- confint(nbglmm, parm = "beta_", method = "Wald") # Beware: The Wald method is less accurate but much, much faster.

# The exponentiated coefficients are also known as Incidence Rate Ratios (IRR)
IRR <- exp(cbind(fixed, confintfixed)
IRR
##                         fixed      2.5 %     97.5 %
## (Intercept)        0.06094028 0.05458271 0.06803835
## factor(decade)1980 1.12590641 1.04184825 1.21674652
## factor(decade)1990 2.46807856 2.28979339 2.66024515
## factor(decade)2000 2.32553168 2.15789585 2.50619029
## factor(decade)2010 2.90962703 2.70410073 3.13077444
## P_NONWHT           1.02021383 0.85653208 1.21517487
## a_hinc             0.87889172 0.86503341 0.89297205
## P_NONWHT:a_hinc    1.16093170 1.11856742 1.20490048

Final thoughts, regarding zero inflation. There is no multilevel implementation (at least that I am aware of) of a zero inflated poisson or negbin model that allows you to specify an equation for zero inflated component of the mixture. the glmmADMB model lets you estimate a constant zero inflation parameter. An alternative approach would be to use the zeroinfl function in the pscl package, though this does not support multilevel models. Thus, you could compare the fit of a single level negative binomial, to the single level zero inflated negative binomial. Chances are that if zero inflation is not significant for single level models, it is likely that it would not be significant for the multilevel specification.

Addendum

If you are concerned about spatial autocorrelation, you could control for this using some form of geographical weighted regression (though I believe this uses point data, not areas). Alternatively, you could group your census tracts by an additional grouping factor (states, counties) and include this as a random effect. Lastly, and I am not sure if this is entirely feasible, it may be possible to incorporate spatial dependence using, for example, the average count of R_VAC in first order neighbours as a covariate. In any case, prior to such approaches, it would be sensible to determine if spatial autocorrelation is indeed present (using Global Moran's I, LISA tests, and similar approaches).

Best Answer

Related Solutions

Solved – Understanding over-dispersion as it relates to the Poisson and the Neg. Binomial

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

Addendum

Related Question