Solved – Given count data with many zero observations, what is a reasonable amount of zero observations in the data

count-datamodelingzero inflation

I have sales data which records at what time (by second) and how many were sold. Therefore, the data is count data with around 90-150 sales during a 3-day period. If I agggregated it to the 10 minute interval, I will have around 75% zero observations because for example 100/(24*3*6) = 0.2314815. But I think this is still leads to too high a zeros/total ratio even when applying the data to zero-inflated Poisson or similar types of models. What is a reasonable amount of zero observations contained in the dataset to apply zero-inflated Poisson like models? According to the literature or examples elsewhere, I think 40% zero observations is acceptable. But I don't know if 50% okay.

Best Answer

I am not sure there are any hard and fast rules about an acceptable number of zeros. Particularly when working with a zero inflated model. Zero inflated models have two parts, one that predicts the probability of $y > 0$, that is $$ P(y_{i} > 0 | x_{i}) = p_{i} = \frac{1}{1 + e^{-Xb_{i}}} $$ This is typically done with a logistic model although probit is also not uncommon. Note that this is a conditional probability, conditional on some vector $x$ which in the simplest case is a scalar, $1$, but could contain information that predicts whether the outcome, $y_i$ is a zero or not. The more accurately you can predict whether an observation will be zero or not, the better the adjustment to your count model. This adjustment is evident in the log likelihood function for the zero-inflated poisson: $$ \mathcal{L} = \sum_{i=1}^{n}\left\{ \begin{array}{rl} ln(p_{i} + (1 - p_{i})e^{(-\mu_{i})}) &\mbox{if $y_{i} = 0$} \\ y_{i}ln(\mu_{i}) + ln(1 - p_{i}) - \mu_{i} - ln(y_{i}!) &\mbox{if $y_{i} > 0$} \end{array} \right. $$ where $\mu_i = e^{x_{i}^{'}\beta}$, the expected count given your model (I assume the canonical log link). In particular, consider the behavior as $p_i$ goes to the extremes: 1 or 0. For $y_i > 0$ the formula converges to: $$ \mathcal{L} = y_{i}ln(\mu_{i}) - \mu_{i} - ln(y_{i}!) $$

More pragmatically, one concern would be do you have sufficient data that are not zero? Estimates will be instable given insufficient data. For example, 250 observations may be great, but if 240 are zeros, even if you can perfectly predict 0/>0, you still only have 10 observations about the actual count distribution. In addition, one thing you could check is the distribution of residuals and the residuals versus fitted values. Particularly, if you are concerned about the number of zeros being an issue, check the residuals and fit of zero values.

If your model is not fitting the zeros or the count data well, you may want to consider some other form of model. One common alternative to the zero-inflated poisson is the zero inflated negative binomial. The primary difference is an over dispersion parameter (although the log likelihood function is rather more complex): $$ \mathcal{L} = \sum_{i=1}^{n} \left\{ \begin{array}{rl} ln(p_{i}) + (1 - p_i)\left(\frac{1}{1 + \alpha\mu_{i}}\right)^{\frac{1}{\alpha}} &\mbox{if $y_{i} = 0$} \\ ln(p_{i}) + ln\Gamma\left(\frac{1}{\alpha} + y_i\right) - ln\Gamma(y_i + 1) - ln\Gamma\left(\frac{1}{\alpha}\right) + \left(\frac{1}{\alpha}\right)ln\left(\frac{1}{1 + \alpha\mu_{i}}\right) + y_iln\left(1 - \frac{1}{1 + \alpha\mu_{i}}\right) &\mbox{if $y_{i} > 0$} \end{array} \right. $$

You could also explore mixture models that assume observed data comes from an underlying mixture of distributions.

Here are some pages that may be helpful either for fitting models, talking about them, or graphing. For full transparency, I was the primary author of those pages. I am sure there are other good resources, but I linked those because I know them off the top of my head.

Zero-inflated poisson Zero-inflated negative binomial Zero-truncated poisson this more for different graphing approaches than actually as a model suggestion, although you could try a zero-truncated model on just non zero observations (i.e., exclude all zeros and see how that compares with the ZIP).

Related Question