Solved – Multivariate hypergeometric distribution in R

hypergeometric-distributionmultivariate analysisr

Say I have a bag of colored marbles. This bag contains 30 marbles, 2 of which are red, 3 are green and the rest are blue. I am now randomly drawing 5 marbles out of this bag, without replacement. I want to calculate the probability that I will draw at least 1 red and at least 1 green marble.

If I just wanted to calculate the probability for a single class (say 1 or more red marble), I could use the upper tail of the hypergeometric cumulative distribution function, in other words calculate 1 – the chance of not drawing a single red marble. E.g. in R, I would run 1 - phyper(0, 2, 30 - 2, 5).

However, I assume the probability to draw one of both isn't simply (1 - phyper(0, 2, 30 - 2, 5)) * (1 - phyper(0, 3, 30 - 3, 5)) because they are not independent.

I believe I may need to use the multivariate hypergeometric distribution for this, but this can only give me the probability that I will draw NEITHER a red NOR a green marble. When I now calculate 1 - p, I will get the probability to draw either at least one red marble, or at least one green marble, or both.

What is the correct way to now calculate the probability of getting both?

Best Answer

If the univariate hypergeometric is your only tool you have to get it into something where you have two classes.

One approach (not the only one):

Break the total up as follows --

Draw 2 non-blue + draw 3 non-blue + ... + draw 5 non-blue.

Then work out the probability under each case; e.g. the first one is:

$P(\text{two non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} P(\text{exactly one red} |$ $\hspace{ 3cm} \text{two balls that are either green or red drawn from the original pool})$

So for the second part, you're essentially drawing two balls from (2 red, 3 green) and working out the probability of exactly 1 red. So it should be the product of two hypergeometric probabilities.

The second term would be

$P(\text{three non-blue balls in 5 draws}) \times$
$\hspace{0.5 cm} [P(\text{exactly one red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})$ $\hspace{0.3cm}+P(\text{exactly two red} |$ $\hspace{ 3cm} \text{three balls that are either green or red drawn from the original pool})]$

This is the sum of two hypergeometric probabilities, times a hypergeometric probability; however, you can write it as the difference of two phyper calls (which doesn't save anything in this term, but will on the remaining ones.

So the overall thing would be a sum of a vector of terms of the form dhyper(...)*(phyper(...)-phyper(...)). Note that you should be able to do the whole thing with a single call of dhyper and two phyper calls (since you can pass vector arguments).

where the dhyper call covers the "draw $i$ non-blue balls" and the difference of phyper terms covers the range of how many reds are drawn out of $i$.


If you do have a multivariate hypergeometric pmf, you should be able to write it as a sum of terms.

You could also approach it in terms of

$P(\bar{B}\geq 2)* [1-P(R=0|\bar{B}\geq 2)-P(G=0|\bar{B}\geq 2)+ P(R=0,G=0|\bar{B}\geq 2)]$

This, too, will involve a sum of terms, but you can generate these using vector arguments as well.

I may come back and try to make this answer more broadly useful.

Related Question