Solved – When to use zero-inflated poisson regression and negative binomial distribution

generalized linear modelnegative-binomial-distributionrzero inflation

I have a fairly simple dataset looking at the relationship between the first nesting date of a bird in a given year (Date) and the birds overall fledgling production from that year (Fledge; count data from 0-3 fledglings). I want to determine the optimal laying date for this species (i.e. the date where fledgling production is highest); however, I have been struggling to work out which statistical analysis is most appropriate for my data.

Most birds produce no fledglings in a year, so there are many zero values in the data. With this in mind, I thought that a zero inflated poisson regression might be most appropriate. To test this in R, I fitted a regular glm with poisson distribution (model1 below) and a zero inflated poisson model using zeroinfl() from the pscl library (model2 below). I then compared the two using vuong test statistic (output below).

model1<-glm(Fledge~Date,data=OPT,family="poisson")
model2<-zeroinfl(Fledge~Date,data=OPT,dist="poisson")
vuong(model1,model2)

Vuong Non-Nested Hypothesis Test-Statistic: 4.25169 
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model1 > model2, with p-value 1.0608e-05

According to the vuong output, a regular non-zero inflated poisson regression is most appropriate. This wasn't that surprising as, from my understanding, the zeros in my data are 'true zeroes' i.e. they are legitimate data points and not due to sampling technique or design. Next I tested for overdispersion by fitting a glm with a quasipoisson distribution to check the dispersion parameter (model3).

model3<-glm(Fledge~Date,data=OPT,family="quasipoisson")
summary(model3)

Call:
glm(formula = Fledge ~ Date, family = "quasipoisson", data = OPT)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7877  -0.6258  -0.5578  -0.4504   3.3648  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.15109    0.54874  -0.275  0.78315   
Date        -0.03288    0.01133  -2.901  0.00385 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.296628)

Null deviance: 455.36  on 642  degrees of freedom
Residual deviance: 443.40  on 641  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

As you can see, the output showed a dispersion parameter close to one (1.297), and a null deviance (455.36) very close to the residual deviance (443.40). So I concluded that overdispersion was not a big problem, and a poisson, rather than negative binomial, distribution was required.

After doing all this, I was fairly confident that my regular poisson regression (model1) was best for my data. HOWEVER, when I plotted the model outputs of the zero inflated and non-zero inflated data, the non-zero inflated model output didn't appear to fit my data at all while the zero inflated model fitted much better.

ggplot(OPT,aes(x=Date,y=Fledge))+
geom_point()+
theme_bw()+
geom_line(data=cbind(OPT,optpred=predict(model1)),aes(y=optpred),size=1,colour="red")+
geom_line(data=cbind(OPT,optpred2=predict(model2)),aes(y=optpred2),size=1,colour="blue)

enter image description here

As you can see, the non-zero inflated model (red line) doesn't seem to fit the data at all, as it is predicting fledgling production less than 0 on all dates (obviously not biologically possible)! Conversely, the zero inflated model (blue line) seems to fit the data very well. So I have two questions to ask;

  1. Were the methods I used to test zero inflation (and overdispersion) appropriate and interpreted correctly?

  2. If so, is my application of a poisson regression appropriate? Or are there other more appropriate options for my data outside of the two I've tried here?

I've included a dput() of my data below to allow for replication.

structure(list(Date = c(45L, 40L, 42L, 41L, 39L, 34L, 40L, 44L, 
36L, 32L, 33L, 89L, 58L, 50L, 46L, 56L, 48L, 69L, 64L, 56L, 61L, 
58L, 66L, 63L, 57L, 58L, 60L, 65L, 31L, 48L, 42L, 41L, 46L, 38L, 
59L, 41L, 65L, 41L, 34L, 41L, 36L, 60L, 42L, 39L, 43L, 46L, 47L, 
38L, 38L, 71L, 65L, 51L, 42L, 37L, 51L, 41L, 65L, 59L, 44L, 50L, 
51L, 47L, 40L, 53L, 56L, 62L, 50L, 46L, 51L, 55L, 50L, 46L, 45L, 
39L, 36L, 52L, 50L, 73L, 42L, 38L, 51L, 49L, 43L, 45L, 44L, 76L, 
68L, 65L, 70L, 56L, 40L, 45L, 49L, 52L, 66L, 80L, 45L, 42L, 44L, 
37L, 48L, 43L, 53L, 31L, 47L, 49L, 44L, 46L, 54L, 55L, 48L, 53L, 
55L, 72L, 54L, 45L, 83L, 59L, 48L, 47L, 52L, 72L, 51L, 70L, 48L, 
44L, 42L, 38L, 48L, 43L, 45L, 39L, 45L, 42L, 64L, 46L, 56L, 34L, 
50L, 48L, 47L, 47L, 60L, 50L, 61L, 40L, 72L, 63L, 55L, 66L, 69L, 
66L, 61L, 60L, 60L, 40L, 70L, 45L, 40L, 41L, 42L, 71L, 54L, 45L, 
52L, 48L, 40L, 39L, 49L, 42L, 43L, 53L, 38L, 53L, 52L, 68L, 61L, 
62L, 87L, 41L, 45L, 37L, 44L, 45L, 43L, 72L, 39L, 56L, 34L, 74L, 
62L, 46L, 43L, 47L, 35L, 54L, 61L, 44L, 49L, 54L, 61L, 37L, 51L, 
48L, 52L, 48L, 48L, 44L, 45L, 44L, 45L, 68L, 61L, 87L, 51L, 52L, 
50L, 50L, 56L, 55L, 56L, 57L, 65L, 41L, 63L, 76L, 52L, 62L, 50L, 
50L, 54L, 63L, 48L, 54L, 46L, 57L, 54L, 52L, 45L, 41L, 54L, 74L, 
69L, 68L, 51L, 60L, 54L, 44L, 67L, 52L, 49L, 43L, 41L, 44L, 49L, 
46L, 43L, 46L, 49L, 46L, 47L, 54L, 55L, 67L, 52L, 55L, 52L, 49L, 
50L, 51L, 57L, 48L, 34L, 54L, 49L, 47L, 71L, 62L, 43L, 45L, 45L, 
49L, 58L, 57L, 55L, 54L, 52L, 51L, 41L, 54L, 70L, 52L, 53L, 53L, 
50L, 71L, 56L, 48L, 33L, 43L, 41L, 68L, 42L, 38L, 39L, 46L, 55L, 
64L, 62L, 56L, 69L, 44L, 49L, 54L, 86L, 46L, 46L, 50L, 44L, 45L, 
55L, 55L, 52L, 49L, 49L, 56L, 41L, 34L, 50L, 62L, 39L, 41L, 56L, 
42L, 40L, 43L, 44L, 45L, 43L, 48L, 41L, 45L, 62L, 49L, 47L, 49L, 
63L, 69L, 46L, 53L, 49L, 59L, 54L, 33L, 46L, 44L, 49L, 36L, 41L, 
33L, 66L, 56L, 67L, 43L, 66L, 31L, 51L, 59L, 57L, 51L, 39L, 44L, 
31L, 40L, 39L, 42L, 27L, 43L, 42L, 78L, 60L, 70L, 64L, 67L, 66L, 
67L, 66L, 62L, 58L, 51L, 50L, 60L, 38L, 45L, 34L, 69L, 38L, 45L, 
39L, 44L, 39L, 44L, 43L, 46L, 37L, 59L, 74L, 59L, 39L, 43L, 40L, 
38L, 45L, 45L, 42L, 36L, 33L, 51L, 64L, 52L, 40L, 89L, 49L, 37L, 
51L, 70L, 65L, 71L, 62L, 61L, 68L, 59L, 54L, 75L, 57L, 55L, 58L, 
52L, 58L, 45L, 50L, 41L, 64L, 49L, 50L, 67L, 54L, 43L, 49L, 54L, 
55L, 53L, 53L, 59L, 47L, 47L, 48L, 45L, 50L, 39L, 48L, 51L, 54L, 
44L, 43L, 56L, 51L, 38L, 71L, 62L, 56L, 65L, 69L, 68L, 52L, 47L, 
47L, 47L, 80L, 51L, 48L, 36L, 32L, 39L, 45L, 31L, 43L, 57L, 65L, 
60L, 62L, 36L, 53L, 64L, 57L, 43L, 71L, 66L, 63L, 49L, 39L, 49L, 
43L, 32L, 47L, 44L, 35L, 35L, 41L, 54L, 50L, 44L, 44L, 48L, 50L, 
41L, 40L, 46L, 48L, 38L, 43L, 54L, 52L, 36L, 62L, 72L, 47L, 66L, 
50L, 51L, 50L, 56L, 47L, 67L, 50L, 35L, 40L, 43L, 42L, 31L, 35L, 
43L, 46L, 45L, 46L, 39L, 40L, 40L, 39L, 36L, 45L, 43L, 44L, 44L, 
44L, 38L, 49L, 52L, 49L, 43L, 42L, 47L, 56L, 51L, 51L, 51L, 59L, 
64L, 46L, 40L, 75L, 65L, 51L, 91L, 56L, 83L, 56L, 57L, 58L, 51L, 
50L, 56L, 40L, 69L, 54L, 45L, 35L, 41L, 48L, 60L, 54L, 39L, 39L, 
31L, 92L, 39L, 66L, 56L, 48L, 44L, 40L, 42L, 47L, 51L, 47L, 45L, 
49L, 69L, 48L, 42L, 58L, 56L, 58L, 61L, 42L, 36L, 47L, 52L, 45L, 
54L, 55L, 62L, 48L, 44L, 54L, 51L, 46L, 44L, 50L, 37L, 33L, 40L, 
57L, 54L, 64L, 59L, 69L, 46L, 40L, 51L, 53L, 79L, 60L), Fledge = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
0L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
2L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 3L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
2L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L)), .Names = c("Date", "Fledge"), class = "data.frame", row.names = c(NA, 
    -643L))

Best Answer

I suspect that your problem may be that the default behavior of predict.glm isn't what you think it is.

Specifically, predict used on a glm object will by default gives a response on the scale of the linear predictors, not the response.

This is quite clearly stated in the help (?predict.glm) but seems to trip people up very often (suggesting the default ought to be changed, perhaps; you might like to raise it on the relevant mailing list).

To get the values you want, try predict(model1,type="response")