Poisson Distribution – How to Perform a C-Test for Comparing Poisson Means in Scipy

poisson distributionscipy

I'm relatively new to stats, and I have a large dataset of strings (n=114541) for which I'm interested in a binary event: whether X occurs or not (specifically, whether a specific language construct is present in a given string). X occurs rarely (443/n). X is poisson-distributed.

The strings in the dataset are arranged in a time series, and I'm trying to determine whether there's a significant change in the occurrence of X before and after a given moment in time. Because X follows a poisson distribution, I can easily calculate lambda before and after that moment in time (let's say that lambda1 = 127/35465, and lambda2 = 316/79076).

lambda1/lambda2 (or lambda1-lambda2) gives me an indication of whether the probability of X changed. What I need to know is if that change is significant (i.e., in the example above, if the difference between 0.00399 before the switch point and 0.0035 after is significant).

I've read that the C-test can test for significance between two lambda values in a poisson distribution, but I'm unsure if that's the right test. And if it is: is there a way of running it in Scipy (I've looked through the documentation, and couldn't find any mention of it)?

Best Answer

At that sort of sample size and probability, the C-test should be okay.

Since this is just a binomial test, you can test it using a binomial test in Scipy. x is 127, n is 127+316 and p is 35465/(35465+79076).

There are other tests for this situation. Vanilla R offers an exact Poisson test for example.

See also:

Krishnamoorthy K. and J. Thomson (2004),
"A more powerful test for comparing two Poisson means,"
Journal of Statistical Planning and Inference, 119, pp 23–35

which indicates that an unconditional test will tend to have greater power (as is generally the case for unconditional tests in this sort of situation).

It is, of course, not exact.


In response to comments:

You're now taking $X$ to be binomial, not Poisson.

I followed your assertion that $X$ was Poisson (by which the 35465 and 79076 are simply exposure) and showed how to do the corresponding C-test.

If you want to treat the 35465 & 79076 as numbers of trials you don't need the C-test at all. You just do a straight two-sample binomial test on the trials and successes you have.

Like so (this is in R):

> prop.test(x=c(127,316),n=c(35465,79076 ))

    2-sample test for equality of proportions with continuity correction

data:  c(127, 316) out of c(35465, 79076)
X-squared = 0.9902, df = 1, p-value = 0.3197
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0011970592  0.0003667388
sample estimates:
     prop 1      prop 2 
0.003580995 0.003996156 

Incidentally, this p-value is very similar to what you get with the C-test.