Solved – How the hypergeometric distribution sums to 1

hypergeometric-distribution

The hypergeometric distribution is defined for $\max(0, n+K-N)\leq k\leq \min(K,n).$
But, when we use Vandermonde's identity to prove that probabilities sum to $1$, then we use the range of $0\leq k \leq n.$ I wonder how this is justified?

Best Answer

The reference mentions that this identity is from combinatorics: that is, it counts things.

What does it count? Consider $N$ objects. Once and for all, divide those $n$ things into a group of $K$ of them, which I will call "red," and the remainder, which I will call "blue." Each subset of $n$ such objects determines, and is determined by, its red objects and its blue objects. The number of such sets with $k$ red objects (and therefore $n-k$ blue objects) equals the number of ways to choose $k$ red objects from all $K$ ones (written $\color{red}{\binom{K}{k}}$) times the number of ways to choose the remaining $n-k$ blue objects from all the $N-K$ ones (written $\color{blue}{\binom{N-K}{n-k}}$).

Now if $k$ is not between $0$ and $K$, then there is no $k$-element subset of $K$ things, so $\binom{K}{k}=0$ in such cases. Similarly, $\binom{N-K}{n-k}=0$ if $n-k$ is not between $0$ and $N-K$. (This not only makes sense, it is actually how good software will evaluate these quantities. Ask R, for instance, to compute choose(5,6) or choose(5,-1): it will return the correct value of $0$ in both cases.)

Summing over all possible numbers $k$ shows that

$$\binom{N}{n} = \sum_k \color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}$$

and as you read this you should say to yourself "any $n$ objects are comprised of some number $k$ of red objects and the remaining $n-k$ blue objects."

The sum needs to include all $k$ for which both the terms $\color{red}{\binom{K}{k}}$ and $\color{blue}{\binom{N-K}{n-k}}$ are nonzero, but it's fine to include any other values of $k$ because they will just introduce some extra zeros into the sum, which does not change it. We just need to make sure all relevant $k$ are included. It suffices to find an obvious lower bound for it ($0$ will do nicely and is more practicable than $-\infty$!) and an obvious upper bound ($N$ works because we cannot find more than $N$ objects altogether). A slightly better upper bound is $n$ (because $k$ counts the red objects in a set of $n$ things). Thus, writing these bounds explicitly and dividing both sides by $\binom{N}{n}$, we obtain

$$1 = \sum_{0\le k\le n}\frac{\color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}}{\binom{N}{n}} .$$

Despite the notation, this formula does not implicitly assert that all values of $k$ in the range from $0$ to $n$ can occur in this distribution. About the only reason to fiddle with the inequalities and figure out what the smallest possible range of $k$ can be would be for writing programs that loop over these values: that might save a little time adding up some zeros.

Related Solutions

Solved – One tailed Fisher’s exact test and the hypergeometric distribution

You are right that there are 7 more extreme tables, as listed below including the original one. The p-value for one-tailed Fisher's exact test is the sum of all the hypergeometric probabilities.

Depending the odds ratio (a*d)/(b*c), if a*d <= b*c, you should use "less" alternative hypothesis, otherwise the "greater" one. Fisher's exact test assumes that marginal totals are fixed, so the whole table can be determined by one cell. If a*d <= b*c, the number of more extreme tables is n=min(a,d) (in your example, n=a, a can range within 0, 1, ..., 6 in more extreme tables), otherwise n=min(b,c).

-------------------  -------------------  -------------------  -------------------
|a=7     |b=10    |  |a=6     |b=11    |  |a=5     |b=12    |  |a=4     |b=13    |
-------------------  -------------------  -------------------  -------------------
|c=17794 |d=1107  |  |c=17795 |d=1106  |  |c=17796 |d=1105  |  |c=17797 |d=1104  |
-------------------  -------------------  -------------------  -------------------
-------------------  -------------------  -------------------  -------------------
|a=3     |b=14    |  |a=2     |b=15    |  |a=1     |b=16    |  |a=0     |b=17    |
-------------------  -------------------  -------------------  -------------------
|c=17798 |d=1103  |  |c=17799 |d=1102  |  |c=17800 |d=1101  |  |c=17801 |d=1100  |
-------------------  -------------------  -------------------  -------------------

Solved – Multivariate hypergeometric distribution in R

If the univariate hypergeometric is your only tool you have to get it into something where you have two classes.

One approach (not the only one):

Break the total up as follows --

Draw 2 non-blue + draw 3 non-blue + ... + draw 5 non-blue.

Then work out the probability under each case; e.g. the first one is:

$P(\text{two non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} P(\text{exactly one red} |$ $\hspace{ 3cm} \text{two balls that are either green or red drawn from the original pool})$

So for the second part, you're essentially drawing two balls from (2 red, 3 green) and working out the probability of exactly 1 red. So it should be the product of two hypergeometric probabilities.

The second term would be

$P(\text{three non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} [P(\text{exactly one red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})$ $\hspace{0.3cm}+P(\text{exactly two red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})]$

This is the sum of two hypergeometric probabilities, times a hypergeometric probability; however, you can write it as the difference of two phyper calls (which doesn't save anything in this term, but will on the remaining ones.

So the overall thing would be a sum of a vector of terms of the form dhyper(...)*(phyper(...)-phyper(...)). Note that you should be able to do the whole thing with a single call of dhyper and two phyper calls (since you can pass vector arguments).

where the dhyper call covers the "draw $i$ non-blue balls" and the difference of phyper terms covers the range of how many reds are drawn out of $i$.

If you do have a multivariate hypergeometric pmf, you should be able to write it as a sum of terms.

You could also approach it in terms of

$P(\bar{B}\geq 2)* [1-P(R=0|\bar{B}\geq 2)-P(G=0|\bar{B}\geq 2)+ P(R=0,G=0|\bar{B}\geq 2)]$

This, too, will involve a sum of terms, but you can generate these using vector arguments as well.

I may come back and try to make this answer more broadly useful.

Best Answer

Related Solutions

Solved – One tailed Fisher’s exact test and the hypergeometric distribution

Solved – Multivariate hypergeometric distribution in R

Related Question