Solved – Model approach for count data with a large range of y values

generalized linear modeloverdispersionplmpoisson distributionpoisson-regression

I am modeling ridership data for specific routes by month over a number of years. Some routes have as little as about 1000 riders per month while other routes may have over 20,000 riders per month. I have been looking at different approaches to model this data including a panel data model and a generalized linear data model (poisson family). However, I have found some information that says you should only use the poisson family when you have a small range in data for the y variable.

Is there a better approach to modeling count data with a large range of y values than Poisson?

Best Answer

For count data it is indicated (for reasons of interpretability of estimated parameters) to use a generalized linear model (GLM) with logarithmic link function, see my answer to Goodness of fit and which model to choose linear regression or Poisson

But the distribution family can be choosen in different ways. The reason for the advice you refer, to use Poisson regression when the counts are not to large, is that usually with large counts there is overdispersion, that is, the variance is larger than the variance of the Poisson.

That can be solved in various ways, like using (in R terminology) a quasipoisson family, which can be good enough. Or you can use a negative binomial family. Or, especially if you only wants predictions from the model and not need interpretable parameters, use the traditional way of a usual linear model (that is, identity link function) for the response $\sqrt{Y}$. The square-root transformation is (approximately) variance stabilizing for the Poisson (and quasi-Poisson) families. Look at Why is the square root transformation recommended for count data? for an explanation of this!

For more about overdispersion, see Modelling a Poisson distribution with overdispersion and Comparing overdispersion distributions