Solved – When to use Poisson vs. geometric vs. negative binomial GLMs for count data

count-datageneralized linear modelnegative-binomial-distributionpoisson-regressionzero inflation

I'm trying to layout for myself when it's appropriate to use which regression type (geometric, Poisson, negative binomial) with count data, within the GLM framework (only 3 of the 8 GLM distributions are used for count data, although most of what I've read centers around the negative binomial and Poisson distributions).

When to use Poisson vs. geometric vs. negative binomial GLMs for count data?


So far I have the following logic: Is it count data? If Yes, Are the mean and variance unequal? If Yes, negative binomial regression. If no, Poisson regression. Is there zero inflation? If yes, zero inflated Poisson or zero inflated negative binomial.

Question 1 There doesn't seem to be a clear indication of which to use when. Is there something to inform that decision? From what I understand, once you switch to ZIP, the mean variance being equal assumption get's relaxed so it's pretty similar to NB again.

Question 2 Where does the geometric family fit into this or what kind of questions should I be asking of the data when deciding whether to use a geometric family in my regression?

Question 3 I see people interchanging the negative binomial and Poisson distributions all the time but not geometric, so I'm guessing there's something distinctly different about when to use it. If so, what is it?

P.S.
I've made a (probably oversimplified, from the comments) diagram (editable) of my current understanding if people wanted to comment/tweak it for discussion.
Count Data: GLM Decision Tree

Best Answer

Both the Poisson distribution and the geometric distribution are special cases of the negative binomial (NB) distribution. One common notation is that the variance of the NB is $\mu + 1/\theta \cdot \mu^2$ where $\mu$ is the expectation and $\theta$ is responsible for the amount of (over-)dispersion. Sometimes $\alpha = 1/\theta$ is also used. The Poisson model has $\theta = \infty$, i.e., equidispersion, and the geometric has $\theta = 1$.

So in case of doubt between these three models, I would recommend to estimate the NB: The worst case is that you lose a little bit of efficiency by estimating one parameter too many. But, of course, there are also formal tests for assessing whether a certain value for $\theta$ (e.g., 1 or $\infty$) is sufficient. Or you can use information criteria etc.

Of course, there are also loads of other single- or multi-parameter count data distributions (including the compound Poisson you mentioned) which sometimes may or may not lead to significantly better fits.

As for excess zeros: The two standard strategies are to either use a zero-inflated count data distribution or a hurdle model consisting of a binary model for zero or greater plus a zero-truncated count data model. As you mention excess zeros and overdispersion may be confounded but often considerable overdispersion remains even after adjusting the model for excess zeros. Again, in case of doubt, I would recommend to use an NB-based zero inflation or hurdle model by the same logic as above.

Disclaimer: This is a very brief and simple overview. When applying the models in practice, I would recommend to consult a textbook on the topic. Personally, I like the count data books by Winkelmann and that by Cameron & Trivedi. But there are other good ones as well. For an R-based discussion you might also like our paper in JSS (http://www.jstatsoft.org/v27/i08/).