Zero-Inflation – Difference Between Zero-Inflated and Hurdle Models Explained

zero inflation

I wonder if there is a clear-cut difference between the so-called zero-inflated distributions (models) and so-called hurdle-at-zero distributions (models)? The terms occur quite often in the literature and I suspect they are not the same, but would you please explain me the difference in simple terms?

Best Answer

Thank you for the interesting question!

Difference: One limitation of standard count models is that the zeros and the nonzeros (positives) are assumed to come from the same data-generating process. With hurdle models, these two processes are not constrained to be the same. The basic idea is that a Bernoulli probability governs the binary outcome of whether a count variate has a zero or positive realization. If the realization is positive, the hurdle is crossed, and the conditional distribution of the positives is governed by a truncated-at-zero count data model. With zero-inflated models, the response variable is modelled as a mixture of a Bernoulli distribution (or call it a point mass at zero) and a Poisson distribution (or any other count distribution supported on non-negative integers). For more detail and formulae, see, for example, Gurmu and Trivedi (2011) and Dalrymple, Hudson, and Ford (2003).

Example: Hurdle models can be motivated by sequential decision-making processes confronted by individuals. You first decide if you need to buy something, and then you decide on the quantity of that something (which must be positive). When you are allowed to (or can potentially) buy nothing after your decision to buy something is an example of a situation where zero-inflated model is appropriate. Zeros may come from two sources: a) no decision to buy; b) wanted to buy but ended up buying nothing (e.g. out of stock).

Beta: The hurdle model is a special case of the two-part model described in Chapter 16 of Frees (2011). There, we will see that for two-part models, the amount of health care utilized may be a continuous as well as a count variable. So what has been somewhat confusingly termed "zero-inflated beta distribution" in the literature is in fact belongs in the class of two-part distributions and models (so common in actuarial science), which is consistent with the above definition of a hurdle model. This excellent book discussed zero-inflated models in section 12.4.1 and hurdle models in section 12.4.2, with formulas and examples from actuarial applications.

History: zero-inflated Poisson (ZIP) models without covariates have a long history (see e.g., Johnson and Kotz, 1969). The general form of ZIP regression models incorporating covariates is due to Lambert (1992). Hurdle models were first proposed by a Canadian statistician Cragg (1971), and later developped further by Mullahy (1986). You may also consider Croston (1972), where positive geometric counts are used together with Bernoulli process to describe an integer-valued process dominated by zeros.

R: Finally, if you use R, there is package pscl for "Classes and Methods for R developed in the Political Science Computational Laboratory" by Simon Jackman, containing hurdle() and zeroinfl() functions by Achim Zeileis.

The following references have been consulted to produce the above:

  • Gurmu, S. & Trivedi, P. K. Excess Zeros in Count Models for Recreational Trips Journal of Business & Economic Statistics, 1996, 14, 469-477
  • Johnson, N., Kotz, S., Distributions in Statistics: Discrete Distributions. 1969, Houghton MiZin, Boston
  • Lambert, D., Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics, 1992, 34 (1), 1–14.
  • Cragg, J. G. Some Statistical Models for Limited Dependent Variables with Application to the Demand for Durable Goods Econometrica, 1971, 39, 829-844
  • Mullahy, J. Specification and testing of some modified count data models Journal of Econometrics, 1986, 33, 341-365
  • Frees, E. W. Regression Modeling with Actuarial and Financial Applications Cambridge University Press, 2011
  • Dalrymple, M. L.; Hudson, I. L. & Ford, R. P. K. Finite Mixture, Zero-inflated Poisson and Hurdle models with application to SIDS Computational Statistics & Data Analysis, 2003, 41, 491-504
  • Croston, J. D. Forecasting and Stock Control for Intermittent Demands Operational Research Quarterly, 1972, 23, 289-303