The reference mentions that this identity is from combinatorics: that is, it counts things.
What does it count? Consider $N$ objects. Once and for all, divide those $n$ things into a group of $K$ of them, which I will call "red," and the remainder, which I will call "blue." Each subset of $n$ such objects determines, and is determined by, its red objects and its blue objects. The number of such sets with $k$ red objects (and therefore $n-k$ blue objects) equals the number of ways to choose $k$ red objects from all $K$ ones (written $\color{red}{\binom{K}{k}}$) times the number of ways to choose the remaining $n-k$ blue objects from all the $N-K$ ones (written $\color{blue}{\binom{N-K}{n-k}}$).
Now if $k$ is not between $0$ and $K$, then there is no $k$-element subset of $K$ things, so $\binom{K}{k}=0$ in such cases. Similarly, $\binom{N-K}{n-k}=0$ if $n-k$ is not between $0$ and $N-K$. (This not only makes sense, it is actually how good software will evaluate these quantities. Ask R
, for instance, to compute choose(5,6)
or choose(5,-1)
: it will return the correct value of $0$ in both cases.)
Summing over all possible numbers $k$ shows that
$$\binom{N}{n} = \sum_k \color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}$$
and as you read this you should say to yourself "any $n$ objects are comprised of some number $k$ of red objects and the remaining $n-k$ blue objects."
The sum needs to include all $k$ for which both the terms $\color{red}{\binom{K}{k}}$ and $\color{blue}{\binom{N-K}{n-k}}$ are nonzero, but it's fine to include any other values of $k$ because they will just introduce some extra zeros into the sum, which does not change it. We just need to make sure all relevant $k$ are included. It suffices to find an obvious lower bound for it ($0$ will do nicely and is more practicable than $-\infty$!) and an obvious upper bound ($N$ works because we cannot find more than $N$ objects altogether). A slightly better upper bound is $n$ (because $k$ counts the red objects in a set of $n$ things). Thus, writing these bounds explicitly and dividing both sides by $\binom{N}{n}$, we obtain
$$1 = \sum_{0\le k\le n}\frac{\color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}}{\binom{N}{n}} .$$
Despite the notation, this formula does not implicitly assert that all values of $k$ in the range from $0$ to $n$ can occur in this distribution. About the only reason to fiddle with the inequalities and figure out what the smallest possible range of $k$ can be would be for writing programs that loop over these values: that might save a little time adding up some zeros.
If the univariate hypergeometric is your only tool you have to get it into something where you have two classes.
One approach (not the only one):
Break the total up as follows --
Draw 2 non-blue + draw 3 non-blue + ... + draw 5 non-blue.
Then work out the probability under each case; e.g. the first one is:
$P(\text{two non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} P(\text{exactly one red} |$ $\hspace{ 3cm} \text{two balls that are either green or red drawn from the original pool})$
So for the second part, you're essentially drawing two balls from (2 red, 3 green) and working out the probability of exactly 1 red. So it should be the product of two hypergeometric probabilities.
The second term would be
$P(\text{three non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} [P(\text{exactly one red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})$ $\hspace{0.3cm}+P(\text{exactly two red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})]$
This is the sum of two hypergeometric probabilities, times a hypergeometric probability; however, you can write it as the difference of two phyper
calls (which doesn't save anything in this term, but will on the remaining ones.
So the overall thing would be a sum of a vector of terms of the form dhyper(...)*(phyper(...)-phyper(...))
. Note that you should be able to do the whole thing with a single call of dhyper
and two phyper
calls (since you can pass vector arguments).
where the dhyper
call covers the "draw $i$ non-blue balls" and the difference of phyper
terms covers the range of how many reds are drawn out of $i$.
If you do have a multivariate hypergeometric pmf, you should be able to write it as a sum of terms.
You could also approach it in terms of
$P(\bar{B}\geq 2)*
[1-P(R=0|\bar{B}\geq 2)-P(G=0|\bar{B}\geq 2)+
P(R=0,G=0|\bar{B}\geq 2)]$
This, too, will involve a sum of terms, but you can generate these using vector arguments as well.
I may come back and try to make this answer more broadly useful.
Best Answer
Unless you need to evaluate the Gauss hypergeometric function for complex values of the parameters or the variable, it is better to use Robin Hankin's
gsl
package.Based on my experience I also recommend to only evaluate the Gauss hypergeometric function for a value of the variable lying in $[0,1]$, and use a transformation formula for values in $]-\infty, 0]$.
Update
Here is my alternative implementation with the gmp package (at least, for fun)