Solved – Given count data with many zero observations, what is a reasonable amount of zero observations in the data

count-datamodelingzero inflation

I have sales data which records at what time (by second) and how many were sold. Therefore, the data is count data with around 90-150 sales during a 3-day period. If I agggregated it to the 10 minute interval, I will have around 75% zero observations because for example 100/(24*3*6) = 0.2314815. But I think this is still leads to too high a zeros/total ratio even when applying the data to zero-inflated Poisson or similar types of models. What is a reasonable amount of zero observations contained in the dataset to apply zero-inflated Poisson like models? According to the literature or examples elsewhere, I think 40% zero observations is acceptable. But I don't know if 50% okay.

Best Answer

I am not sure there are any hard and fast rules about an acceptable number of zeros. Particularly when working with a zero inflated model. Zero inflated models have two parts, one that predicts the probability of $y > 0$, that is $$ P(y_{i} > 0 | x_{i}) = p_{i} = \frac{1}{1 + e^{-Xb_{i}}} $$ This is typically done with a logistic model although probit is also not uncommon. Note that this is a conditional probability, conditional on some vector $x$ which in the simplest case is a scalar, $1$, but could contain information that predicts whether the outcome, $y_i$ is a zero or not. The more accurately you can predict whether an observation will be zero or not, the better the adjustment to your count model. This adjustment is evident in the log likelihood function for the zero-inflated poisson: $$ \mathcal{L} = \sum_{i=1}^{n}\left\{ \begin{array}{rl} ln(p_{i} + (1 - p_{i})e^{(-\mu_{i})}) &\mbox{if $y_{i} = 0$} \\ y_{i}ln(\mu_{i}) + ln(1 - p_{i}) - \mu_{i} - ln(y_{i}!) &\mbox{if $y_{i} > 0$} \end{array} \right. $$ where $\mu_i = e^{x_{i}^{'}\beta}$, the expected count given your model (I assume the canonical log link). In particular, consider the behavior as $p_i$ goes to the extremes: 1 or 0. For $y_i > 0$ the formula converges to: $$ \mathcal{L} = y_{i}ln(\mu_{i}) - \mu_{i} - ln(y_{i}!) $$

More pragmatically, one concern would be do you have sufficient data that are not zero? Estimates will be instable given insufficient data. For example, 250 observations may be great, but if 240 are zeros, even if you can perfectly predict 0/>0, you still only have 10 observations about the actual count distribution. In addition, one thing you could check is the distribution of residuals and the residuals versus fitted values. Particularly, if you are concerned about the number of zeros being an issue, check the residuals and fit of zero values.

If your model is not fitting the zeros or the count data well, you may want to consider some other form of model. One common alternative to the zero-inflated poisson is the zero inflated negative binomial. The primary difference is an over dispersion parameter (although the log likelihood function is rather more complex): $$ \mathcal{L} = \sum_{i=1}^{n} \left\{ \begin{array}{rl} ln(p_{i}) + (1 - p_i)\left(\frac{1}{1 + \alpha\mu_{i}}\right)^{\frac{1}{\alpha}} &\mbox{if $y_{i} = 0$} \\ ln(p_{i}) + ln\Gamma\left(\frac{1}{\alpha} + y_i\right) - ln\Gamma(y_i + 1) - ln\Gamma\left(\frac{1}{\alpha}\right) + \left(\frac{1}{\alpha}\right)ln\left(\frac{1}{1 + \alpha\mu_{i}}\right) + y_iln\left(1 - \frac{1}{1 + \alpha\mu_{i}}\right) &\mbox{if $y_{i} > 0$} \end{array} \right. $$

You could also explore mixture models that assume observed data comes from an underlying mixture of distributions.

Here are some pages that may be helpful either for fitting models, talking about them, or graphing. For full transparency, I was the primary author of those pages. I am sure there are other good resources, but I linked those because I know them off the top of my head.

Zero-inflated poisson Zero-inflated negative binomial Zero-truncated poisson this more for different graphing approaches than actually as a model suggestion, although you could try a zero-truncated model on just non zero observations (i.e., exclude all zeros and see how that compares with the ZIP).

Related Solutions

Solved – Zero-inflated count models in R: what is the real advantage

I think this is a poorly chosen data set for exploring the advantages of zero inflated models, because, as you note, there isn't that much zero inflation.

plot(fitted(fm_pois), fitted(fm_zinb))

shows that the predicted values are almost identical.

In data sets with more zero-inflation, the ZI models give different (and usually better fitting) results than Poisson.

Another way to compare the fit of the models is to compare the size of residuals:

boxplot(abs(resid(fm_pois) - resid(fm_zinb)))

shows that, even here, the residuals from the Poisson are smaller than those from the ZINB. If you have some idea of a magnitude of the residual that is really problematic, you can see what proportion of the residuals in each model are above that. E.g. if being off by more than 1 was unacceptable

sum(abs(resid(fm_pois) > 1))
sum(abs(resid(fm_zinb) > 1))

shows the latter is a bit better - 20 fewer large residuals.

Then the question is whether the added complexity of the models is worth it to you.

Solved – Dealing with zero-inflation if the data are not count data type

They're still called "zero-inflated" models when modifying continuous distributions; there's zero-inflated gamma, zero-inflated lognormal, and so on.

For continuous proportions such as you describe, a zero-inflated beta model might be used (though it's not the only possibility, it's probably the most common by some distance).

If 100% coverage in that predefined part is possible, you might instead use 0-1 inflated beta models (sometimes called 0-and-1 inflated beta models; the search above finds some links for these as well), or if some other density between 0 and 1 is more suitable, some other form of 0-1-inflated continuous model.

Best Answer

Related Solutions

Solved – Zero-inflated count models in R: what is the real advantage

Solved – Dealing with zero-inflation if the data are not count data type

Related Question