Solved – Zero-inflation on steroids: choose among Poisson, negative binomial and zero-inflated regressions

negative-binomial-distributionpoisson distributionrregressionzero inflation

I am struggling to fit alternative count models into my data. I guess my problem is just too many zeros.

This is my data

> summary(smpl)
    response        predict1          predict2        
 Min.   :0.000   Min.   :   1.00   Min.   :    22005  
 1st Qu.:0.000   1st Qu.:   3.00   1st Qu.:  4669705  
 Median :0.000   Median :   8.00   Median : 12540318  
 Mean   :0.017   Mean   :  23.27   Mean   : 20382574  
 3rd Qu.:0.000   3rd Qu.:  20.00   3rd Qu.: 25468156  
 Max.   :3.000   Max.   :1584.00   Max.   :145348049

> table(smpl$response)
  0   1   2   3 
987  10   2   1 

I tried three regressions: basic Poisson, negative binomial and zero-inflated but the only formula returning coefficients without warnings is the Poisson:

> summary(glm(response ~ ., data = smpl, family = poisson))

Call:
glm(formula = response ~ ., family = poisson, data = smpl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3871  -0.2214  -0.1722  -0.1148   4.7861  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.472e+00  3.521e-01  -9.862  < 2e-16 ***
predict1     3.229e-03  7.271e-04   4.442 8.93e-06 ***
predict2    -6.258e-08  3.060e-08  -2.045   0.0409 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 150.67  on 999  degrees of freedom
Residual deviance: 135.84  on 997  degrees of freedom
AIC: 170.06

Number of Fisher Scoring iterations: 8

The negative binomial returns a warnings on both the convergence and the alternation limit

summary(glm.nb(response ~ ., data = smpl))

Call:
glm.nb(formula = response ~ ., data = smpl, init.theta = 0.04901296596, 
    link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.28844  -0.17677  -0.14542  -0.09808   2.38314  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.899e+00  4.587e-01  -8.499  < 2e-16 ***
predict1     1.226e-02  2.144e-03   5.720 1.06e-08 ***
predict2    -5.982e-08  3.407e-08  -1.756   0.0791 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.049) family taken to be 1)

    Null deviance: 69.927  on 999  degrees of freedom
Residual deviance: 55.940  on 997  degrees of freedom
AIC: 152.37

Number of Fisher Scoring iterations: 1


              Theta:  0.0490 
          Std. Err.:  0.0251 
Warning while fitting theta: alternation limit reached 

 2 x log-likelihood:  -144.3700 
Warning messages:
1: glm.fit: algorithm did not converge 
2: In glm.nb(response ~ ., data = smpl) : alternation limit reached

and the zero-inflated (from the pscl package) doesn't return anything at all

> summary(zeroinfl(response ~ ., data = smpl, dist = "negbin"))

Call:
zeroinfl(formula = response ~ ., data = smpl, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.45252 -0.08817 -0.05515 -0.04210 19.56118 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.477e+00         NA      NA       NA
predict1     2.678e-03         NA      NA       NA
predict2    -1.160e-07         NA      NA       NA
Log(theta)  -1.241e+00         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.869e+00         NA      NA       NA
predict1    -1.329e-01         NA      NA       NA
predict2    -1.346e-07         NA      NA       NA
Error in if (getOption("show.signif.stars") & any(rbind(x$coefficients$count,  : 
  missing value where TRUE/FALSE needed

Then my questions are:

  1. Is there anything I can do in terms of "formula tweaking" with the negative binomial (to avoid the warnings) and with the zero-inflated (to get the coefficients)?
  2. Looking only at the results above (thus including problems with convergence and alternation limit) should I select the negative binomial model since it seems, looking at the AIC, to fit better than the Poisson in my data?

Best Answer

You have 13 observations that are non-zero and only 3 observations that are greater than 1. Hence, I believe the best thing you can do is to use a binary regression only (0 vs > 0). Possibly it could help to add some form of regularization for that as well, e.g. bias reduction (see package brglm), because non-zeros are so rare.

Furthermore, I would recommend to bring the predict variables to a similar scale, e.g., dividing predict2 by 1e6 or so. Otherwise, numerical optimization algorithms have problems. This might improve convergence for glm.nb and zeroinfl (which by the way does not return nothing - there are coefficients but the covariance matrix estimate does not seem valid).