Solved – How the hypergeometric distribution sums to 1

hypergeometric-distribution

The hypergeometric distribution is defined for $\max(0, n+K-N)\leq k\leq \min(K,n).$
But, when we use Vandermonde's identity to prove that probabilities sum to $1$, then we use the range of $0\leq k \leq n.$ I wonder how this is justified?

Best Answer

The reference mentions that this identity is from combinatorics: that is, it counts things.

What does it count? Consider $N$ objects. Once and for all, divide those $n$ things into a group of $K$ of them, which I will call "red," and the remainder, which I will call "blue." Each subset of $n$ such objects determines, and is determined by, its red objects and its blue objects. The number of such sets with $k$ red objects (and therefore $n-k$ blue objects) equals the number of ways to choose $k$ red objects from all $K$ ones (written $\color{red}{\binom{K}{k}}$) times the number of ways to choose the remaining $n-k$ blue objects from all the $N-K$ ones (written $\color{blue}{\binom{N-K}{n-k}}$).

Now if $k$ is not between $0$ and $K$, then there is no $k$-element subset of $K$ things, so $\binom{K}{k}=0$ in such cases. Similarly, $\binom{N-K}{n-k}=0$ if $n-k$ is not between $0$ and $N-K$. (This not only makes sense, it is actually how good software will evaluate these quantities. Ask R, for instance, to compute choose(5,6) or choose(5,-1): it will return the correct value of $0$ in both cases.)

Summing over all possible numbers $k$ shows that

$$\binom{N}{n} = \sum_k \color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}$$

and as you read this you should say to yourself "any $n$ objects are comprised of some number $k$ of red objects and the remaining $n-k$ blue objects."

The sum needs to include all $k$ for which both the terms $\color{red}{\binom{K}{k}}$ and $\color{blue}{\binom{N-K}{n-k}}$ are nonzero, but it's fine to include any other values of $k$ because they will just introduce some extra zeros into the sum, which does not change it. We just need to make sure all relevant $k$ are included. It suffices to find an obvious lower bound for it ($0$ will do nicely and is more practicable than $-\infty$!) and an obvious upper bound ($N$ works because we cannot find more than $N$ objects altogether). A slightly better upper bound is $n$ (because $k$ counts the red objects in a set of $n$ things). Thus, writing these bounds explicitly and dividing both sides by $\binom{N}{n}$, we obtain

$$1 = \sum_{0\le k\le n}\frac{\color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}}{\binom{N}{n}} .$$

Despite the notation, this formula does not implicitly assert that all values of $k$ in the range from $0$ to $n$ can occur in this distribution. About the only reason to fiddle with the inequalities and figure out what the smallest possible range of $k$ can be would be for writing programs that loop over these values: that might save a little time adding up some zeros.