Zero-Inflation – Difference Between Zero-Inflated and Hurdle Models Explained

zero inflation

I wonder if there is a clear-cut difference between the so-called zero-inflated distributions (models) and so-called hurdle-at-zero distributions (models)? The terms occur quite often in the literature and I suspect they are not the same, but would you please explain me the difference in simple terms?

Best Answer

Thank you for the interesting question!

Difference: One limitation of standard count models is that the zeros and the nonzeros (positives) are assumed to come from the same data-generating process. With hurdle models, these two processes are not constrained to be the same. The basic idea is that a Bernoulli probability governs the binary outcome of whether a count variate has a zero or positive realization. If the realization is positive, the hurdle is crossed, and the conditional distribution of the positives is governed by a truncated-at-zero count data model. With zero-inflated models, the response variable is modelled as a mixture of a Bernoulli distribution (or call it a point mass at zero) and a Poisson distribution (or any other count distribution supported on non-negative integers). For more detail and formulae, see, for example, Gurmu and Trivedi (2011) and Dalrymple, Hudson, and Ford (2003).

Example: Hurdle models can be motivated by sequential decision-making processes confronted by individuals. You first decide if you need to buy something, and then you decide on the quantity of that something (which must be positive). When you are allowed to (or can potentially) buy nothing after your decision to buy something is an example of a situation where zero-inflated model is appropriate. Zeros may come from two sources: a) no decision to buy; b) wanted to buy but ended up buying nothing (e.g. out of stock).

Beta: The hurdle model is a special case of the two-part model described in Chapter 16 of Frees (2011). There, we will see that for two-part models, the amount of health care utilized may be a continuous as well as a count variable. So what has been somewhat confusingly termed "zero-inflated beta distribution" in the literature is in fact belongs in the class of two-part distributions and models (so common in actuarial science), which is consistent with the above definition of a hurdle model. This excellent book discussed zero-inflated models in section 12.4.1 and hurdle models in section 12.4.2, with formulas and examples from actuarial applications.

History: zero-inflated Poisson (ZIP) models without covariates have a long history (see e.g., Johnson and Kotz, 1969). The general form of ZIP regression models incorporating covariates is due to Lambert (1992). Hurdle models were first proposed by a Canadian statistician Cragg (1971), and later developped further by Mullahy (1986). You may also consider Croston (1972), where positive geometric counts are used together with Bernoulli process to describe an integer-valued process dominated by zeros.

R: Finally, if you use R, there is package pscl for "Classes and Methods for R developed in the Political Science Computational Laboratory" by Simon Jackman, containing hurdle() and zeroinfl() functions by Achim Zeileis.

The following references have been consulted to produce the above:

Gurmu, S. & Trivedi, P. K. Excess Zeros in Count Models for Recreational Trips Journal of Business & Economic Statistics, 1996, 14, 469-477
Johnson, N., Kotz, S., Distributions in Statistics: Discrete Distributions. 1969, Houghton MiZin, Boston
Lambert, D., Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics, 1992, 34 (1), 1–14.
Cragg, J. G. Some Statistical Models for Limited Dependent Variables with Application to the Demand for Durable Goods Econometrica, 1971, 39, 829-844
Mullahy, J. Specification and testing of some modified count data models Journal of Econometrics, 1986, 33, 341-365
Frees, E. W. Regression Modeling with Actuarial and Financial Applications Cambridge University Press, 2011
Dalrymple, M. L.; Hudson, I. L. & Ford, R. P. K. Finite Mixture, Zero-inflated Poisson and Hurdle models with application to SIDS Computational Statistics & Data Analysis, 2003, 41, 491-504
Croston, J. D. Forecasting and Stock Control for Intermittent Demands Operational Research Quarterly, 1972, 23, 289-303

Related Solutions

Solved – Zero inflated models – “true zero” vs. “excess zero”

I only know what I've read, but I believe the difference is that excess zeros are zeros where there could not be any events, while true zeros occur where there could have been an event, but there was none. For example, people coming into a bank: during business hours, there might be a period of time when zero customers entered the bank (true zero), but when the bank is closed, you will always get zeros (excess zeros) and since the bank is closed more than it is open you will get a lot of excess zeros.

Solved – Zero-inflated count models in R: what is the real advantage

I think this is a poorly chosen data set for exploring the advantages of zero inflated models, because, as you note, there isn't that much zero inflation.

plot(fitted(fm_pois), fitted(fm_zinb))

shows that the predicted values are almost identical.

In data sets with more zero-inflation, the ZI models give different (and usually better fitting) results than Poisson.

Another way to compare the fit of the models is to compare the size of residuals:

boxplot(abs(resid(fm_pois) - resid(fm_zinb)))

shows that, even here, the residuals from the Poisson are smaller than those from the ZINB. If you have some idea of a magnitude of the residual that is really problematic, you can see what proportion of the residuals in each model are above that. E.g. if being off by more than 1 was unacceptable

sum(abs(resid(fm_pois) > 1))
sum(abs(resid(fm_zinb) > 1))

shows the latter is a bit better - 20 fewer large residuals.

Then the question is whether the added complexity of the models is worth it to you.

Best Answer

Related Solutions

Solved – Zero inflated models – “true zero” vs. “excess zero”

Solved – Zero-inflated count models in R: what is the real advantage

Related Question