Exact equations for predictions in the hurdle model

generalized linear modelzero inflation

I am using the pscl package in R to fit the hurdle model for count data (default specification: binomial for zero hurdle part and Poisson for the truncated count part). The built-in predict(type = 'response') method for the model gives me the expected count for a given set of predictors. However, I am not able to reproduce the result by manual calculation. Please help me with the exact equations based on the estimated coefficients, so I may better understand it.

My attempt: Before fitting a hurdle model, I was using a zero-inflated Poisson model from the same package, and I was able to crack the code by undermentioned equations. I've taken different argument names $x$ and $z$ to denote the different sets of predictors for each part of the model, and likewise for the corresponding coefficients $\beta$ and $\gamma$).

  1. $\log{\lambda(x)} = \beta_{0}x_{0}+\beta_{1}x_{1}+\dots+\beta_{n}x_{n} = x^\top \beta$ for the count part, gives me $\lambda$.

  2. $\log{\dfrac{\pi(z)}{1-\pi(z)}} = \gamma_{0}z_{0}+\gamma_{1}z_{1}+\dots+\gamma_{m}z_{m} = z^\top \gamma$ for the binomial part, gives me $\pi$.

  3. Then I calculated $(1-\pi)\lambda$ to be the expected count for that observation.

This worked out well, giving me exact answers as the built-in predict method. Now hurdle model varies only a bit, where the expected value is coming out to be $(1-\pi)\frac{\lambda}{1-\exp(-\lambda)}$ according to my calculation. But, when I check the outputs using predict output (0.02826577) and this manual output (1.296293) for the all 0 input, the answers are not the same. I guess the fitting part equation itself is changing. Please help me with the equation.

Model fitting results.

Best Answer

Root of the problem: You are right about most details but the predict(..., type = "zero") does not give the zero hurdle probability $\pi$ for the hurdle model. Instead it gives the zero inflation probability.

Motivation: Our idea was that this should help to compare hurdle() and zeroinfl() outputs. Unfortunately, that did not work out well but lead to some confusion. In the countreg package (providing improved versions of the pscl count regression functionality) we have rewritten the predict() method completely and will hopefully release it soon to CRAN.

Details: The exact equations for the model specifications, the corresponding mean equations, and the predictions are provided in: Zeileis A, Kleiber C, Jackman S (2008). "Regression Models for Count Data in R." Journal of Statistical Software, 27(8). doi:10.18637/jss.v027.i08. See in particular Sections 2.2 and 2.3 for the model specifications and Appendix C for the predictions.

Illustration: To illustrate the concrete equations, let us consider a simplified hurdle model for the bioChemists data. For simplicity we just use a single regressor ment for both $x$ and $z$.

library("pscl")
data("bioChemists", package = "pscl")
m <- hurdle(art ~ ment | ment, data = bioChemists)
coef(m)
## count_(Intercept)        count_ment  zero_(Intercept)         zero_ment 
##        0.53943647        0.01870839        0.24870755        0.08092126 

For predictions let us consider two observations with ment being either 0 or 10. Using almost the notation from your question we have:

x <- z <- c(0, 10)
beta <- coef(m, model = "count")
gamma <- coef(m, model = "zero")
lambda <- exp(beta[1] + beta[2] * x)
pi <- plogis(gamma[1] + gamma[2] * z)
pi/(1 - dpois(0, lambda)) * lambda
## [1] 1.175071 1.757169

The main difference to your notation is that pi denotes the probability to cross the hurdle (and not the complementary probability of observing a zero).

Thus overall expectations are the type = "response" predictions:

nd <- data.frame(ment = x)
predict(m, newdata = nd, type = "response")
##        1        2 
## 1.175071 1.757169 

And the lambda is the type = "count" prediction:

lambda
## [1] 1.715040 2.067873
predict(m, newdata = nd, type = "count")
##        1        2 
## 1.715040 2.067873 

To obtain pi or rather 1 - pi we can use type = "prob" for at = 0:

1 - pi
## [1] 0.4381416 0.2577071
predict(m, newdata = nd, type = "prob", at = 0:4)
##           0         1         2         3          4
## 1 0.4381416 0.2114617 0.1813327 0.1036643 0.04444710
## 2 0.2577071 0.2222019 0.2297427 0.1583595 0.08186684

So far so good. The remaining question is what does type = "zero" give us? Well, it's the zero-inflation probability which is the ratio of non-zero probabilities from the zero hurdle and the count component:

pi/(1 - dpois(0, lambda))
## [1] 0.6851568 0.8497472
predict(m, newdata = nd, type = "zero")
##         1         2 
## 0.6851568 0.8497472 

The benefit of this is that you can compare predict(..., type = "zero") between the hurdle() and the zeroinfl() output. And the type = "response" prediction is just the product of the other two predictions:

predict(m, newdata = nd, type = "response")
##        1        2 
## 1.175071 1.757169 
predict(m, newdata = nd, type = "zero") * predict(m, newdata = nd, type = "count")
##        1        2 
## 1.175071 1.757169 

I thought that was a neat property that was worth bringing out in the predict() method. However, I'm afraid that it confused more pscl users than it helped...

Outlook: Watch out for when the countreg package becomes available on CRAN. It will provide more prediction types with better documentation and hopefully less confusing labels.

Related Question