Solved – Poisson or quasi poisson in a regression with count data and overdispersion

count-dataoverdispersionpoisson-regressionquasi-likelihood

I have count data (demand/offer analysis with counting number of customers, depending on – possibly – many factors). I tried a linear regression with normal errors, but my QQ-plot is not really good. I tried a log transformation of the answer: once again, bad QQ-plot.

So now, I'm trying a regression with Poisson Errors. With a model with all significant variables, I get:

Null deviance: 12593.2  on 53  degrees of freedom
Residual deviance:  1161.3  on 37  degrees of freedom
AIC: 1573.7

Number of Fisher Scoring iterations: 5

Residual deviance is larger than residual degrees of freedom: I have overdispersion.

How can I know if I need to use quasipoisson? What's the goal of quasipoisson in this case? I read this advise in "The R Book" by Crawley, but I don't see the point nor a large improvement in my case.

Best Answer

When trying to determine what sort of glm equation you want to estimate, you should think about plausible relationships between the expected value of your target variable given the right hand side (rhs) variables and the variance of the target variable given the rhs variables. Plots of the residuals vs. the fitted values from from your Normal model can help with this. With Poisson regression, the assumed relationship is that the variance equals the expected value; rather restrictive, I think you'll agree. With a "standard" linear regression, the assumption is that the variance is constant regardless of the expected value. For a quasi-poisson regression, the variance is assumed to be a linear function of the mean; for negative binomial regression, a quadratic function.

However, you aren't restricted to these relationships. The specification of a "family" (other than "quasi") determines the mean-variance relationship. I don't have The R Book, but I imagine it has a table that shows the family functions and corresponding mean-variance relationships. For the "quasi" family you can specify any of several mean-variance relationships, and you can even write your own; see the R documentation. It may be that you can find a much better fit by specifying a non-default value for the mean-variance function in a "quasi" model.

You also should pay attention to the range of the target variable; in your case it's nonnegative count data. If you have a substantial fraction of low values - 0, 1, 2 - the continuous distributions probably won't fit well, but if you don't, there's not much value in using a discrete distribution. It's rare that you'd consider Poisson and Normal distributions as competitors.