Solved – use simple linear regression with count data

count-datalinear modelpoisson distribution

My analysis is about systematic changes over time.
I have counted data (values range from 2 to 7) that follows a normal distribution (according to the histogram and kurtosis), but this answer recommends analyze count data with GLM (log link function).
In my case, can I use simple linear regression? what would be a test to justify it? My sample size is big (~ 8,000)

thank you for your answers.

Best Answer

Your count data does not follow a normal distribution, because it simply can not. Because it can not, simple linear regression is not the way to go. That being said, a GLM with poisson distribution is not all that difficult. It can be done by a beginner. So why not just try it? Can you think of any good reason, why not?

Related Solutions

Solved – Does using count data as independent variable violate any of GLM assumptions

There are some nuances at play here, and they may be creating some confusion.

You state that you understand the assumptions of a logistic regression include "iid residuals... ". I would argue that this is not quite correct. We generally do say that about the General Linear Model (i.e., regression), but in that case it means that the residuals are independent of each other, with the same distribution (typically normal) having the same mean (0), and variance (i.e., constant variance: homogeneity of variance / homoscedasticity). Note however that for the Bernoulli distribution and the Binomial distribution, the variance is a function of the mean. Thus, the variance couldn't be constant, unless the covariate were perfectly unrelated to the response. That would be an assumption so restrictive as to render logistic regression worthless. I note that in the abstract of the pdf you cite, it lists the assumptions starting with "the statistical independence of the observations", which we might call i-but-not-id (without meaning to be too cute about it).

Next, as @kjetilbhalvorsen notes in the comment above, covariate values (i.e., your independent variables) are assumed to be fixed in the Generalized Linear Model. That is, no particular distributional assumptions are made. Thus, it does not matter if they are counts or not, nor if they range from 0 to 10, from 1 to 10000, or from -3.1415927 to -2.718281828.

One thing to consider, however, as @whuber notes, if you have a small number of data that are very extreme on one of the covariate dimensions, those points could have a great deal of influence over the results of your analysis. That is, you might get a certain result only because of those points. One way to think about this is to do a kind of sensitivity analysis by fitting your model both with and without those data included. You may believe it is safer or more appropriate to drop those observations, use some form of robust statistical analysis, or to transform those covariates so as to minimize the extreme leverage those points would have. I would not characterize these considerations as "assumptions", but they are certainly important considerations in developing an appropriate model.

Solved – Non-normally distributed data – Box-Cox transformation

The data are highly skewed & take just a few discrete values: the within-pair differences must consist of predominantly noughts & ones; no transformation will make them look much like normal variates. This is typical of count data where counts are fairly low.

If you assume that counts for each individual $j$ follow a different Poisson distribution, & that the change from low to high load condition has the same multiplicative effect on the rate parameter of each, you can extend the idea in significance of difference between two counts to a matched-pair design by conditioning on the total count for each pair, $n_j$:

$ \sum_{j=1}^m X_{1j} \sim \mathrm{Bin} (\sum_{j=1}^n n_j, \theta)$

where $m$ is the no. pairs. So the analysis reduces to inference about the Bernoulli parameter in a binomial experiment— 7 "successes" out of 24 trials if I read your graphs right.

Check the homogeneity of proportions across pairs—& note if they're too homogeneous it might indicate underdispersion (relative to a Poisson) of the original count variables.

Note that this approach is equivalent to the generalized linear model suggested for Poisson Repeated Measures ANOVA^†: while it tells you nothing about the nuisance parameters, point & interval estimates for the parameter of interest can be worked out on the back of a fag packet (so you don't need to worry about software requirements).

† Parametrize your model with the log odds $\zeta=\log_\mathrm{e} \frac{\theta}{1-\theta}$: then the maximum-likelihood estimator is $$\hat\zeta=\log_\mathrm{e}\frac{\sum x_{1j}}{\sum n_j - \sum x_{1j}}=\log_\mathrm{e}\frac{7}{24-7}\approx -0.887$$ with standard error $$\sqrt\frac{\sum n_j}{\sum x_{1j}(\sum n_j-\sum x_{1j})}=\sqrt\frac{24}{7\cdot(24-7)}\approx 0.449$$ for Wald tests & confidence intervals. If you want to adjust for over-/under-dispersion (i.e. use "quasi-Poisson" regression) , estimate the dispersion parameter as Pearson's chi-squared statistic (for association) divided by its degrees of freedom (22) & multiply the standard error by its square root.

Best Answer

Related Solutions

Solved – Does using count data as independent variable violate any of GLM assumptions

Solved – Non-normally distributed data – Box-Cox transformation

Related Question