Solved – Model approach for count data with a large range of y values

generalized linear modeloverdispersionplmpoisson distributionpoisson-regression

I am modeling ridership data for specific routes by month over a number of years. Some routes have as little as about 1000 riders per month while other routes may have over 20,000 riders per month. I have been looking at different approaches to model this data including a panel data model and a generalized linear data model (poisson family). However, I have found some information that says you should only use the poisson family when you have a small range in data for the y variable.

Is there a better approach to modeling count data with a large range of y values than Poisson?

Best Answer

For count data it is indicated (for reasons of interpretability of estimated parameters) to use a generalized linear model (GLM) with logarithmic link function, see my answer to Goodness of fit and which model to choose linear regression or Poisson

But the distribution family can be choosen in different ways. The reason for the advice you refer, to use Poisson regression when the counts are not to large, is that usually with large counts there is overdispersion, that is, the variance is larger than the variance of the Poisson.

That can be solved in various ways, like using (in R terminology) a quasipoisson family, which can be good enough. Or you can use a negative binomial family. Or, especially if you only wants predictions from the model and not need interpretable parameters, use the traditional way of a usual linear model (that is, identity link function) for the response $\sqrt{Y}$. The square-root transformation is (approximately) variance stabilizing for the Poisson (and quasi-Poisson) families. Look at Why is the square root transformation recommended for count data? for an explanation of this!

For more about overdispersion, see Modelling a Poisson distribution with overdispersion and Comparing overdispersion distributions

Related Solutions

Count Data Modeling – Strategy for Deciding the Appropriate Model

You can always compare count models by looking at their predictions (preferrably on a hold out set). J. Scott Long discusses this graphically (plotting the predicted values against actuals). His text book here describes in details but you can also look at 6.4 on this document.

You can compare models using AIC or BIC and there is also a test called Voung test that I am not terribly familiar with but can compare zero inflated to non nested models. Here is a Sas paper describing it briefly on page 10 to get you started. It also is implmented in R posting

Longitudinal Count Data – Choosing the Right Model for Longitudinal Count Data

What you are looking for might be a Generalized Linear Mixed Model, i.e. a Poisson model with a random intercept to start with.

To motivate the Generalized Linear Mixed Model choice, I will provide some background from Mixed Models Theory and Applications by E. Demidenko (below, refer a "cluster" to a particular firm from your data):

Often data have a clustered (panel or tabular) structure. Classical statistics assumes that observations are independent and identically distributed (iid). Applied to clustered data, this assumption may lead to false results. In contrast, the mixed effects model treats clustered data adequately and assumes two sources of variation, within cluster and between clusters. Two types of coefficients are distinguished in the mixed model: population-averaged and cluster (or subject) - specific. The former have the same meaning as in classical statistics, but the latter are random and are estimated as posteriori means.

and:

The Generalized Linear Mixed Model (GLMM) is an extension of the Generalized Linear Model (GLM) complicated by random effects.

I believe that you are interested in allowing for random effects for years by firm.

This can be obtained with glmer {lme4} function in R:

set.seed(1)
# Longitudinal data in a "long" format 
data.sim <- data.frame(firmID = rep(1:120, times = 4),
                       year = c(rep(1, 120), rep(2, 120), rep(3, 120), rep(4, 120)),
                       X1 = rnorm(120*4),
                       X2 = runif(120*4),
                       res = rpois(120*4, lambda = 10))
head(data.sim)
#   firmID year         X1        X2 res
# 1      1    1 -0.6264538 0.3604340  11
# 2      2    1  0.1836433 0.4421617   7
# 3      3    1 -0.8356286 0.1257292   7
# 4      4    1  1.5952808 0.6243645   6
# 5      5    1  0.3295078 0.3024313   7
# 6      6    1 -0.8204684 0.2396372  12

# Fit a model 
model.glmer <- glmer(res ~ 1 + year + X1 + X2 + (year | firmID),
                     data = data.sim, 
                     family = poisson(link = "log"))
summary(model.glmer)

Best Answer

Related Solutions

Count Data Modeling – Strategy for Deciding the Appropriate Model

Longitudinal Count Data – Choosing the Right Model for Longitudinal Count Data

Related Question