You are right that there are 7 more extreme tables, as listed below including the original one. The p-value for one-tailed Fisher's exact test is the sum of all the hypergeometric probabilities.
Depending the odds ratio (a*d)/(b*c)
, if a*d <= b*c
, you should use "less" alternative hypothesis, otherwise the "greater" one. Fisher's exact test assumes that marginal totals are fixed, so the whole table can be determined by one cell. If a*d <= b*c
, the number of more extreme tables is n=min(a,d)
(in your example, n=a
, a
can range within 0, 1, ..., 6
in more extreme tables), otherwise n=min(b,c)
.
------------------- ------------------- ------------------- -------------------
|a=7 |b=10 | |a=6 |b=11 | |a=5 |b=12 | |a=4 |b=13 |
------------------- ------------------- ------------------- -------------------
|c=17794 |d=1107 | |c=17795 |d=1106 | |c=17796 |d=1105 | |c=17797 |d=1104 |
------------------- ------------------- ------------------- -------------------
------------------- ------------------- ------------------- -------------------
|a=3 |b=14 | |a=2 |b=15 | |a=1 |b=16 | |a=0 |b=17 |
------------------- ------------------- ------------------- -------------------
|c=17798 |d=1103 | |c=17799 |d=1102 | |c=17800 |d=1101 | |c=17801 |d=1100 |
------------------- ------------------- ------------------- -------------------
If the univariate hypergeometric is your only tool you have to get it into something where you have two classes.
One approach (not the only one):
Break the total up as follows --
Draw 2 non-blue + draw 3 non-blue + ... + draw 5 non-blue.
Then work out the probability under each case; e.g. the first one is:
$P(\text{two non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} P(\text{exactly one red} |$ $\hspace{ 3cm} \text{two balls that are either green or red drawn from the original pool})$
So for the second part, you're essentially drawing two balls from (2 red, 3 green) and working out the probability of exactly 1 red. So it should be the product of two hypergeometric probabilities.
The second term would be
$P(\text{three non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} [P(\text{exactly one red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})$ $\hspace{0.3cm}+P(\text{exactly two red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})]$
This is the sum of two hypergeometric probabilities, times a hypergeometric probability; however, you can write it as the difference of two phyper
calls (which doesn't save anything in this term, but will on the remaining ones.
So the overall thing would be a sum of a vector of terms of the form dhyper(...)*(phyper(...)-phyper(...))
. Note that you should be able to do the whole thing with a single call of dhyper
and two phyper
calls (since you can pass vector arguments).
where the dhyper
call covers the "draw $i$ non-blue balls" and the difference of phyper
terms covers the range of how many reds are drawn out of $i$.
If you do have a multivariate hypergeometric pmf, you should be able to write it as a sum of terms.
You could also approach it in terms of
$P(\bar{B}\geq 2)*
[1-P(R=0|\bar{B}\geq 2)-P(G=0|\bar{B}\geq 2)+
P(R=0,G=0|\bar{B}\geq 2)]$
This, too, will involve a sum of terms, but you can generate these using vector arguments as well.
I may come back and try to make this answer more broadly useful.
Best Answer
The reference mentions that this identity is from combinatorics: that is, it counts things.
What does it count? Consider $N$ objects. Once and for all, divide those $n$ things into a group of $K$ of them, which I will call "red," and the remainder, which I will call "blue." Each subset of $n$ such objects determines, and is determined by, its red objects and its blue objects. The number of such sets with $k$ red objects (and therefore $n-k$ blue objects) equals the number of ways to choose $k$ red objects from all $K$ ones (written $\color{red}{\binom{K}{k}}$) times the number of ways to choose the remaining $n-k$ blue objects from all the $N-K$ ones (written $\color{blue}{\binom{N-K}{n-k}}$).
Now if $k$ is not between $0$ and $K$, then there is no $k$-element subset of $K$ things, so $\binom{K}{k}=0$ in such cases. Similarly, $\binom{N-K}{n-k}=0$ if $n-k$ is not between $0$ and $N-K$. (This not only makes sense, it is actually how good software will evaluate these quantities. Ask
R
, for instance, to computechoose(5,6)
orchoose(5,-1)
: it will return the correct value of $0$ in both cases.)Summing over all possible numbers $k$ shows that
$$\binom{N}{n} = \sum_k \color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}$$
and as you read this you should say to yourself "any $n$ objects are comprised of some number $k$ of red objects and the remaining $n-k$ blue objects."
The sum needs to include all $k$ for which both the terms $\color{red}{\binom{K}{k}}$ and $\color{blue}{\binom{N-K}{n-k}}$ are nonzero, but it's fine to include any other values of $k$ because they will just introduce some extra zeros into the sum, which does not change it. We just need to make sure all relevant $k$ are included. It suffices to find an obvious lower bound for it ($0$ will do nicely and is more practicable than $-\infty$!) and an obvious upper bound ($N$ works because we cannot find more than $N$ objects altogether). A slightly better upper bound is $n$ (because $k$ counts the red objects in a set of $n$ things). Thus, writing these bounds explicitly and dividing both sides by $\binom{N}{n}$, we obtain
$$1 = \sum_{0\le k\le n}\frac{\color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}}{\binom{N}{n}} .$$
Despite the notation, this formula does not implicitly assert that all values of $k$ in the range from $0$ to $n$ can occur in this distribution. About the only reason to fiddle with the inequalities and figure out what the smallest possible range of $k$ can be would be for writing programs that loop over these values: that might save a little time adding up some zeros.