Solved – Poisson distribution and statistical significance

distributionspoisson distributionrstatistical significance

Lets say I have a website which gets 100 hits per day (mu = 100). Yesterday my website got 130 hits (x = 130). If I assume a Poisson distribution, then the probability of getting 130 hits is:

> dpois(130, 100)
[1] 0.0005752527 # about 0.06%

So this tells me that getting 130 hits is quite unusual for my website due to the low probability.

My understanding of statistical significance is that it is used to determine whether the outcome of an experiment is due either to chance or some kind of deterministic relationship.

How would I apply that in this situation?
What test should one use? (and is it in R?)

Many thanks in advance for your time.

Note: I saw someone at a business talk asked something very similar to this and I had no idea what they meant by it, and so now I'm just trying to educate myself. I'm new to R, but that seems like the software most used for these kind of questions, hence my request.

Best Answer

There are two points to make:

It is not the specific value of 130 that is unusual, but that it is much larger than 100. If you got more than 130 hits, that would have been even more surprising. So we usually look at the P(X>=130), not just P(X=130). By your logic even 100 hits would be unusual, because dpois(100,100)=0.04. So a more correct calculation is to look at ppois(129, 100, lower=F)=0.00228. This is still small, but not as extreme as your value. And this does not even take into account, that an unusually low number of hits might also surprise you. We often multiply the probability of exceeding the observed count by 2 to account for this.
If you keep checking your hits every day, sooner or later even rare events will occur. For example P(X>=130) happens to be close to 1/365, so such an event would be expected to occur once a year.

Related Solutions

Distributions – Relationship Between Poisson and Exponential Distribution

I will use the following notation to be as consistent as possible with the wiki (in case you want to go back and forth between my answer and the wiki definitions for the poisson and exponential.)

$N_t$: the number of arrivals during time period $t$

$X_t$: the time it takes for one additional arrival to arrive assuming that someone arrived at time $t$

By definition, the following conditions are equivalent:

$ (X_t > x) \equiv (N_t = N_{t+x})$

The event on the left captures the event that no one has arrived in the time interval $[t,t+x]$ which implies that our count of the number of arrivals at time $t+x$ is identical to the count at time $t$ which is the event on the right.

By the complement rule, we also have:

$P(X_t \le x) = 1 - P(X_t > x)$

Using the equivalence of the two events that we described above, we can re-write the above as:

$P(X_t \le x) = 1 - P(N_{t+x} - N_t = 0)$

But,

$P(N_{t+x} - N_t = 0) = P(N_x = 0)$

Using the poisson pmf the above where $\lambda$ is the average number of arrivals per time unit and $x$ a quantity of time units, simplifies to:

$P(N_{t+x} - N_t = 0) = \frac{(\lambda x)^0}{0!}e^{-\lambda x}$

i.e.

$P(N_{t+x} - N_t = 0) = e^{-\lambda x}$

Substituting in our original eqn, we have:

$P(X_t \le x) = 1 - e^{-\lambda x}$

The above is the cdf of a exponential pdf.

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Best Answer

Related Solutions

Distributions – Relationship Between Poisson and Exponential Distribution

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

Related Question