Solved – correct statistical test to compare 2 poisson distributions

hypothesis testingpoisson distributionrare-eventsstatistical significance

I know there are similar posts, and I've read a few of them already, but none of them give explanations that I actually understand. I'm completely new to statistics and am finding it more difficult that there seems to be many versions of tests suggested when comparing 2 poisson distributions. The worst part, is I'm not even sure if my data follows a poisson distribution and am not sure how to find out.

My problem is simple: I've got counts of mutations within a specific gene (of a specific length) that I'm looking at totalled over a population of patients. Assuming mutations are random, my null hypothesis expects that the rate at which this specific gene gets hit would be the same as any other gene (so I'm comparing it against all mutations over all genes of all patients vs all mutations landing only on a specific gene). I'm getting the rate simply by dividing the total number of mutations, over the length of the gene (in bases). Numbers are extremely small and rates are something like 2×10^-5 (for the specific gene) and 7×10^-6 (for all mutations in all genes totalled). I just want to compare to see if the rates I'm getting are significantly different from each other. That is the whole goal.

Am I even looking in the right direction by googling poisson distribution? The rates are extremely small which is why I thought to google "statistics of rare events" and came across poisson in the first place.

Best Answer

I'm not even sure if my data follows a poisson distribution and am not sure how to find out.

Let's start there. You can't say "my data are definitely Poisson". It's more a question of whether it's a reasonable model.

There are two main approaches.

  1. is to investigate whether requirements that will yield a Poisson distribution for the data are met or likely to be met or that you are prepared to assume are met, or are sufficiently closely met in some sense.

  2. is to see whether this kind of data appear to be close enough to Poisson that inferences obtained by assuming it for your data would be 'close enough' for your purposes. (How close it needs to be depends on your needs, preferences and so on.)

In case 1, the obvious thing to consider is whether you can treat it like a Poisson process. You need:

(1) constant intensity of events within a single variable

(2) independence

(3) "rare" events (so that, for example, the chance of more than one event occurring in a very small interval of time is correspondingly small) -- they don't actually have to be rare overall, just rare in small intervals of time.

The second approach would seek to identify how non-Poisson data like yours is and so what specific forms of non-Poissonness you might have, and to consider the degree to which that might affect your inference. There are some suitable alternatives to consider (such as the negative binomial, which might be more suitable if the intensity varies from observation to observation).

totalled over a population of patients

This would be one potential source of 'non-constant rate' (heterogeneity of people) that might lead you to consider whether the variable is 'overdispersed' (tends to show more variation relative to the mean than you'd expect from a Poisson).

I just want to compare to see if the rates I'm getting are significantly different from each other. That is the whole goal.

Then it may be better to start with that goal. A Poisson may be appropriate, but that goal can be approached even if it isn't.

Am I even looking in the right direction by googling poisson distribution?

It's a pretty good place to start.

What's a typical expected count?


There are a number of approaches to comparing two Poisson counts.

Perhaps the most common is to condition on the total count and test whether the counts are in proportion to the ratio of the specific gene to all other genes. The conditioning converts the test to a binomial proportion.

There are a number of other ways to approach it even with a simple Poisson comparison.

Some other issues:

If there are any other variables to account for, you might consider a GLM.

You might also consider whether patients should be treated as random effects.

If you're not prepared to assume it's Poisson, you might consider a variety of other possibilities; perhaps a permutation test.

Related Question