Distributions – Using Poisson Distribution for Normally Distributed Count Data

count-datadistributionsgeneralized linear modelnormal distribution

I have some count data on woodland species which appears to be fairly normally distributed (See histogram below). I am fitting it as the response variable in a glm with multiple explanatory variables. As the distribution appears normal, in this instance is it better to use a Gaussian family for the glm despite it being count data? I have looked at similar questions but none are really asking the same thing as this.

Count data

Best Answer

You appear to be looking at the marginal distribution of the counts. Only the conditional distribution matters (although addressing the converse situation, see my answer to: What if residuals are normally distributed, but y is not?). To assess this properly, fit a model and look at the residuals.

There are several issues with fitting an incorrect type of model to data (e.g., an OLS regression for count data):

  1. The predicted values can go outside of the possible range (e.g., $\hat{y}<0$). It should be easy to check this. Using spline functions for continuous explanatory variables may allow the model to fit well enough within the range of your covariates (but extrapolation should be considered verboten).
  2. The residual distribution will have non-constant variance. This should also be easy to check, and you could always use a sandwich estimator for testing.
  3. The data will not be normal (i.e., they are discrete). This is not really a big deal for testing parameters, as you seem to have a lot of data. It would be very sketchy if you want to make prediction intervals.
  4. It may well not be the right way to think about your situation. This is a toughie, but only you can ultimately say.

All in all, it isn't clear you should use OLS based on what you've presented here. It may be acceptable, or may be possible to make it acceptable, but you'll have to check and think carefully about the results, and your situation and goals.