Solved – Linear Regression for a discrete count dependent variable

count-dataregression

I want to model the number of trips taken by households to investigate the effects of income, number of cars available, etc. on the number of trips.

One potential issue is that the probability distribution for Y_i at X_i won't be normally distributed. The number of trips are whole integers in a small range, which I believe violates an assumption of general linear models. How does this translate to models where the values the dependent variable can take on is also limited, but over a much larger range (for instance, regressions with income as the dependent variable)?

Best Answer

By "GLM", I assume you mean the General Linear Model, which generalizes multiple regression and ANOVA. For what it's worth, many people (and I) just call that 'multiple regression'; I typically reserve "GLM" for generalized linear model. You are right that the general linear model requires normality. For the sake of clarity, only the errors / residuals need to be normal, neither X nor Y itself actually do. (To understand this better, it may help to read my answer here: What if residuals are normally distributed, but y is not?)

On the other hand, there isn't necessarily any problem for the generalized linear model to handle count data, so long as the distribution used falls within the exponential family. The prototypical GLM for count data is Poisson regression. That may be a good option for your data. However, note that the Poisson is actually fairly restrictive: the number of zeros needs to be 'just right', and the variance of the conditional distribution of the counts needs to be equal to the conditional mean. Those constraints are not often met. As a result, a number of other options exist: quasi-Poisson regression (cf., here and/or here), negative binomial regression (cf., here), and zero-inflated and hurdle models (cf., here). If you aren't very familiar with all this, that may be a bit to navigate. Another option, if you are more familiar with it, is to use ordinal logistic regression. All you need there is to be able to say that, e.g., 1 trip is more than 0 trips. Many of the types of models mentioned in this paragraph are demonstrated at the excellent UCLA statistics help site.


Regarding your question about how this scales up to situations where there are more response possibilities, but that still cannot be negative (like income), the issue is complicated. The truth is that many variables can only take positive values, but are treated as normal(ish) and modeled with linear regression anyway. The prototypical example of regression modeling is adult height (going back to Galton), but heights cannot be negative. The actual question isn't whether the errors are perfectly normal, they never will be. The actual question is rather: Is it good enough? And there the answer might well be 'yes'.

A common problem with non-negative data is that the variance scales with the mean. In this case, people will often use a transformation, or use a robust, heteroscedasticity-consistent 'sandwich' estimator for the standard errors. (For an overview of strategies used in that kind of situation, see my answer here: Alternatives to one-way ANOVA for heteroskedastic data.)

There are distributions that are specific to this kind of situation, and that are compatible with the GLM / are members of the exponential family, namely the Gamma distribution. Gamma regression could well be used to model income, and I believe is occasionally, but in truth, I think other approaches are used more commonly.

Related Question