Poisson Distribution in R – Simple Comparison of Two Poisson Means

count-datameanpoisson distributionrsmall-sample

I have two samples and I would like to determine whether the difference between them is statistically significant or not:
enter image description here

Because this is clearly data of small counts that cannot be approximated by a normal distribution, I don't think a t-test is appropriate. Instead, I believe I may assume that these counts follow (two distinct) Poisson distributions. As stated here, each of these Poisson distributions is best summarized by the respective sample means, as follows:

$$
\overline{y} = \frac{1}{N} \sum_{i = 1}^n{y_i}
$$

This obviously yields non-integer means (0.4 and 2.6, respectively), so I cannot use functions like poisson.test() or any of the functions from the exactci library directly.

I am aware of this answer regarding the C-test and E-test – but is there a straightforward implementation in R that would do this?

The only reasonable way to do this using poisson.test() I can think of would be to sum the rates for each condition and use the numbers of counts as the respective T parameters:

# Count data for each respective Condition
Cond1 <- c(0, 0, 0, 1, 1)
Cond2 <- c(1, 2, 3, 3, 4)

poisson.test(
  x = c(sum(Cond1), sum(Cond2)),
  T = c(length(Cond1), length(Cond2)),
  alternative = "two.sided"
)

which yields the following result:

    Comparison of Poisson rates

data:  c(sum(Cond1), sum(Cond2)) time base: c(length(Cond1), length(Cond2))
count1 = 2, expected count1 = 7.5, p-value = 0.007385
alternative hypothesis: true rate ratio is not equal to 1
95 percent confidence interval:
 0.01685531 0.67955077
sample estimates:
rate ratio 
 0.1538462

Is this valid or is there a better way to do this in R?

Best Answer

This is what I would do which I believe it is what @Dave is hinting at in comment.

Fit GLM with Poisson familiy distribution and see the effect of "condition" on the mean of each group.

Prepare the dataset:

Cond1 <- c(0, 0, 0, 1, 1)
Cond2 <- c(1, 2, 3, 3, 4)

dat <- data.frame(
    cond= c(rep('C1', length(Cond1)), rep('C2', length(Cond2))),
    count= c(Cond1, Cond2)
)

Fit the model and assess the significance of "condition":

fit <- glm(count ~ cond, data= dat, family= 'poisson')
summary(fit)

Call:
glm(formula = count ~ cond, family = "poisson", data = dat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.13533  -0.89443  -0.07296   0.65703   0.80391  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.9163     0.7071  -1.296   0.1950  
condC2        1.8718     0.7595   2.464   0.0137 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 14.8823  on 9  degrees of freedom
Residual deviance:  5.8682  on 8  degrees of freedom
AIC: 27.731

Number of Fisher Scoring iterations: 5

So the mean of Cond 1 is estimated as exp(-0.9163) = 0.4, the mean of Cond 2 is exp(-0.9163 + 1.8718) = 2.6 and the difference has p = 0.0137 for the null hypothesis of being zero.