Solved – Strategy for deciding appropriate model for count data

count-datageneralized linear modelnegative-binomial-distributionoverdispersionpoisson distribution

What is the appropriate strategy for deciding which model to use with count data?
I have count data that i need to model as a multilevel model and it was recommended to me (on this site) that the best way to do so this is through bugs or MCMCglmm. However i am still trying to learn about bayesian statistics, and i thought i should first try to fit my data as generalized linear models and ignore the nested structure of the data (just so i can get a vague idea of what to expect).

About 70% of the data are 0 and the ratio of variance to the mean is 33. So the data is quite over-dispersed.

After trying a number of different options (including poisson, negative binomial, quasi and zero inflated model) i see very little consistency in the results (varying from everything is significant to nothing is significant).

How can i go about making an informed decision about which type of model to choose based on the 0 inflation and over-dispersion?
For instance, how can i infer that quasi-poisson is more appropriate than negative binomial (or vise versa) and how can i know that using either has dealt adequately (or not) with the excess zeros?
Similarly, how do i evaluate that there is no more over-dispersion if a zero-inflated model is used? or how should i decide between a zero inflated poisson and a zero inflated negative binomial?

Best Answer

You can always compare count models by looking at their predictions (preferrably on a hold out set). J. Scott Long discusses this graphically (plotting the predicted values against actuals). His text book here describes in details but you can also look at 6.4 on this document.

You can compare models using AIC or BIC and there is also a test called Voung test that I am not terribly familiar with but can compare zero inflated to non nested models. Here is a Sas paper describing it briefly on page 10 to get you started. It also is implmented in R posting