Regression – Applying OLS Regression with Count Data: Methods and Assumptions

assumptionscount-dataleast squarespoisson-regressionregression

I have the following linear model in R:

model <- lm(response ~ v1*v2 + v3*v2 + v4, data=df)

v2 is number of hours spent sleeping

v4 is ordinal (7 pt likert) measuring subject rating of sleep quality

v1 and v3 are 3 level factors measuring time spent on different activities (0-10mins, 11-20, 20+)

response is a count variable ranging from 1-6. This measures number of correct items on a 6 item quiz

I'm wondering what criteria is used to decide whether poisson regression should be used instead. I've considered the following:

  • I've read that in a poisson model the mean and variance of the
    response should be equal, which is not true in this case (mean = 5,
    variance = 1.1, mode = 6).

  • The distribution is negatively skewed, which works in favour
    of using a poisson model. What types of transformations are possible
    if I wanted to use OLS?

  • The range of the variable is 1-6. I believe one reason to use a
    poisson is due to the bounding at 0, however, I dont have any 0
    values and the majority of the values are in the upper range (6)

  • Does poisson regression require a larger sample size than OLS to gain
    sufficient power? My N is ~120

  • I've tried running ncvTest() to check for heteroscedasticity and
    the test results are in favour of using OLS (no assumption violation)

Many say that poisson regression should be used to count data no matter what, but OLS doesn't seem unreasonable given some of the points above. What should my primary considerations be and how should I weigh the points outlined above? Is there anything that could be used to argue against the use of a poisson model in this case (maybe sample size?)?

EDIT:

To address the duplicate post concern: I don't believe the other post is asking the same thing (or at least the answer provided there doesn't really help in this case):

  1. The other post is dealing with extensive variables, but in this case we have intensive

  2. Given intensive variables, the other post suggests a linear model is OK but doesn't explain why

  3. The response variable in the other post is unbounded at the upper end (i.e. number of patents). In this case, the response measures number of correct items on an exam. Given there is a maximum value to that (i.e. the value cant be greater than the number of items on the exam), the response here is bounded at both ends, with no respondents touching the lower bound of 0

So my question here is really asking about how to correctly handle positive integer (discrete) response values that are bounded at both ends

Best Answer

As I understand it, your empirical probability of a 0 count is 0, the mean is 5, and the theoretical probability of a count being greater than 6 is 0. A Poisson distribution can never have such properties.

While the ncvTest has not rejected the assumption of homoscedasticity, from your description the assumptions of OLS are also not met, as, your residuals are all going to be in the range of -1 to 5 (or -5 to 1), and this is not what a normal distribution looks like. Also, your data is discrete, so normal was a priori impossible anyway.

What to do? Some options:

  1. Use OLS anyway, with either a log dependent variable or no transformation of the dependent variable. As long as your hypotheses are strongly confirmed or rejected, you may be OK. If you have line-ball results, it is more problematic.
  2. Use a binary logit model comparing <=4 vs >=5, as then you at least have no distributional assumptions to worry about.
  3. Try an ordered logit model. This is going to have more power than the binary logit, but its diagnostics need to be more carefully met as it is making stronger distributional assumptions.
  4. Do all of the above, and, if the conclusions don't change, feel good.
Related Question