Solved – Poisson regression assumptions and how to test them in R

count-datapoisson-regressionrzero inflation

I would like to test in what regression fits my data best. My dependent variable is a count, and has a lot of zeros.

And I would need some help to determine what model and family to use (poisson or quasipoisson, or zero-inflated poisson regression), and how to test the assumptions.

  1. Poisson Regression: as far as I understand, the strong assumption is that dependent variable mean = variance. How do you test this? How close together do they have to be? Are unconditional or conditional mean and variance used for this? What do I do if this assumption does not hold?
  2. I read that if variance is greater than mean we have overdispersion, and a potential way to deal with this is including more independent variables, or family=quasipoisson. Does this distribution have any other requirements or assumptions? What test do I use to see whether (1) or (2) fits better – simply anova(m1,m2)?
  3. I also read that negative-binomial distribution can be used when overdispersion appears. How do I do this in R? What is the difference to quasipoisson?
  4. Zero-inflated Poisson Regression: I read that using the vuong test checks what models fits better.

    > vuong (model.poisson, model.zero.poisson)

    Is that correct? What assumptions does a zero-inflated regression have?

  5. UCLA's Academic Technology Services, Statistical Consulting Group has a section about zero-inflated Poisson Regressions, and test the zeroinflated model (a) against the standard poisson model (b):

    > m.a <- zeroinfl(count ~ child + camper | persons, data = zinb)
    > m.b <- glm(count ~ child + camper, family = poisson, data = zinb)
    > vuong(m.a, m.b)

I don't understand what the | persons part of the first model does, and why you can compare these models. I had expected the regression to be the same and just use a different family.

Best Answer

1) Calculate the mean and the sample variance. $\frac{\bar{X}}{S^2}$ should be $\mathrm{F}(1,n-1)$ distributed, where $n$ is the size of the sample and the process is truly Poisson - since they are independent estimates of the same variance.

Note that this test ignores the covariates - so probably not the best way to check over-dispersion in that situation.

Note also that this test is probably weak against the zero-inflated hypothesis.

3) negative binomial in R: use glm.nb from the MASS package, or use the zeroinfl function from the pscl package using the negative binomial link.

4) zip (zero-inflated Poisson) is a mixture model. You have a binary outcome, according to which a subject belongs to group A (where a 0 is certain) or to group B (where counts are Poisson or neg binomial distributed). An observed 0 is due to subjects from group A + subjects from group B who just happened to be lucky. Both aspects of the model can depend on covariates: group membership is modeled like a logistic (log odds is linear in the covariates) and the Poisson part is modeled in the usual way: log mean is linear in the covariates. So you need the usual assumptions for a logistic (for the certain 0 part) and the usual assumptions for a Poisson. In other words, a zip model will not cure your overdispersion problems - it only cures a big gomp of zeroes.

5) not sure what the data set is and couldn't find the reference. zeroinfl needs a model for both the poisson part and the binary (certain 0 or not) part. The certain 0 part goes second. So m.a is saying that whether the person is a certain 0 or not depends on "persons" - and assuming the subject is not a certain 0, count is a function of camper and child. In other words log(mean) is a linear function of camper and child for those subjects not requiring a 0 count.

m.b is just a general linear model of count in terms of camper and child - both assumed to be fixed effects. The link function is Poisson.