I have the following linear model in R
:
model <- lm(response ~ v1*v2 + v3*v2 + v4, data=df)
v2
is number of hours spent sleeping
v4
is ordinal (7 pt likert) measuring subject rating of sleep quality
v1
and v3
are 3 level factors measuring time spent on different activities (0-10mins, 11-20, 20+)
response
is a count variable ranging from 1-6. This measures number of correct items on a 6 item quiz
I'm wondering what criteria is used to decide whether poisson regression should be used instead. I've considered the following:
-
I've read that in a poisson model the mean and variance of the
response
should be equal, which is not true in this case (mean = 5,
variance = 1.1, mode = 6). -
The distribution is negatively skewed, which works in favour
of using a poisson model. What types of transformations are possible
if I wanted to use OLS? -
The range of the variable is 1-6. I believe one reason to use a
poisson is due to the bounding at 0, however, I dont have any 0
values and the majority of the values are in the upper range (6) -
Does poisson regression require a larger sample size than OLS to gain
sufficient power? My N is ~120 -
I've tried running
ncvTest()
to check for heteroscedasticity and
the test results are in favour of using OLS (no assumption violation)
Many say that poisson regression should be used to count data no matter what, but OLS doesn't seem unreasonable given some of the points above. What should my primary considerations be and how should I weigh the points outlined above? Is there anything that could be used to argue against the use of a poisson model in this case (maybe sample size?)?
EDIT:
To address the duplicate post concern: I don't believe the other post is asking the same thing (or at least the answer provided there doesn't really help in this case):
-
The other post is dealing with extensive variables, but in this case we have intensive
-
Given intensive variables, the other post suggests a linear model is OK but doesn't explain why
-
The response variable in the other post is unbounded at the upper end (i.e. number of patents). In this case, the response measures number of correct items on an exam. Given there is a maximum value to that (i.e. the value cant be greater than the number of items on the exam), the response here is bounded at both ends, with no respondents touching the lower bound of 0
So my question here is really asking about how to correctly handle positive integer (discrete) response values that are bounded at both ends
Best Answer
As I understand it, your empirical probability of a 0 count is 0, the mean is 5, and the theoretical probability of a count being greater than 6 is 0. A Poisson distribution can never have such properties.
While the ncvTest has not rejected the assumption of homoscedasticity, from your description the assumptions of OLS are also not met, as, your residuals are all going to be in the range of -1 to 5 (or -5 to 1), and this is not what a normal distribution looks like. Also, your data is discrete, so normal was a priori impossible anyway.
What to do? Some options: