Solved – Zero inflated negative binomial in Stata

count-datanegative-binomial-distributionzero inflation

I'm trying to fit a negative binomial model on overdispersed count data with a large number of zeros. I have two questions:

1) How do I know that I have an 'excess' number of zeros, and that I should use zero inflated negative binomial?
*note: my outcome variable is number of disease cases by week (total 520 weeks), and the outcome is stratified by sex(male, female), age(categorical variable, n=4) and district (n=3), such that I have 24 possible categories of counts by week (I did that to be able to include individual predictors in the count model)

2) How do I know which variables I should include in the 'inflation model'? Should I just include all predictors?

Best Answer

The idea of zero-inflated models in not that there are a lot of zeros in the dependent variable. Rather it is the idea that there are two separate processes in the data which can lead to an observation of zero. In one process, the observations do not participate in the count process - so could never have observed outcomes $Y_i \ne 0$ (call this the zero-inflation process). In the other, the observations do participate in the count process, but have a count of zero. This, clearly, could lead to an excess of zeroes, since there are two distinct processes for observing a zero.

For example, suppose I am interested in the number of times students in a high school who qualify for free lunch actually eat the school lunch. There could be two reasons that a student would have an observation of zero. First, they could have never turned in the form for free lunch, and thus, although they qualify, are never observed eating a free lunch. These students may eat school lunch a lot, but pay for it, so are never observed to eat free school lunch. Basically, they are unable to participate in the count process. Second, a student may qualify, complete the form, and be able every day to get a free lunch. But they have a zero because they bring lunch from home every day. These types of students can participate in the count process, and so the reason they have an observation of zero is totally different from that first group. The first group's observations are zero and cannot be non-zero. In the second group, some are zero, but could have been non-zero. Suppose, further, that we know student are less likely to complete their free lunch form as they get older. Thus, grade level is a good predictor of "zero-inflation" in this case.

For your data, you need to figure out if there are two processes leading to 0 disease cases by week, one in which only a zero is possible, and one in which zero is possible as part of a count process. I'm not sure what this might be in your case, but you know your data and can explore it to see if this is the case. If the zeros in your data are all a result of a count process (i.e., a case is zero, but could have been non-zero), then a zero inflation model is not appropriate. A regular negative binomial model is fine.

To your second question: From this discussion, it follows that you want to include variables that could predict the first zero process, the zero inflation process that leads to some cases only having 0 as a possible outcome. In the case of my example, I would include grade or age as a predictor of zero-inflation.