[Math] Multivariate Hypergeometric Distribution/Urn Problem

combinatoricsprobabilityprobability distributions

I am having a difficulty with the following multivariate hypergeometric distribution problem. The setting is as usual, an urn contains a total of $M$ balls of $K$ unique colors, with $N_1$ balls of color 1, $N_2$ balls of color 2, …, $N_K$ balls of color $K$ s.t. $N_1+N_2…+N_K = M$. What is the probability that in a sample of size $n$ (without replacement), the ball drawn last has a color not sampled before. For simplicity, we can assume that $N_1=N_2=…=N_K=N$, i.e. $M=KN$.

I have been trying to look at particular cases with $K=2$ and $K=3$ (2 or 3 colors) with different values of sample size, $n$, hoping I could generalize the formulas for arbitrary $K$ and $n$. Thus, for example, for $K=2$ and any value of $n$, I showed that the probability in question could be found by $K \cdot \frac{{N_1 \choose n-1}{N_2 \choose 1}}{n {M \choose n}}$. For $K=3$, we may have two different cases: a) only 2 out of the available three colors are sampled (with $n-1$ balls of the same color and 1 ball of a second color). The desired probability is then $K(K-1)\frac{{N_1 \choose 2}{N_2 \choose 0}{N_3 \choose 1}}{n {M \choose 3}}$. And case b) all three colors are sampled (1 ball of each), then the desired probability is $K(K-1) \frac{{N_1 \choose 1}{N_2 \choose 1}{N_3 \choose 1}}{n{M \choose 3}}$ and the final answer is the sum of (a) and (b).

Does this logic seem reasonable ? Obviously by increasing the values of $k$ and $n$ the number of cases to keep track of will increase too but it seems that each new case may be simplified into (or represented by) a previously worked out scenario. In short, it seems that I may eventually be able to find some recursive relation but after some tedious work. Any ideas would be greatly appreciated. More specifically – is this a good route to go? If yes, are there any shortcuts that I can take ? Is there a completely different approach that I can try ?
Thanks, in advance,
Tamar

Best Answer

One can find, for the general case, an explicit formula for the required probability. We show, for example, how to compute the probability that the last ball drawn is blue and no blue has been drawn in the first $n-1$ draws. For the answer, then add over all colours.

Suppose that $b$ of the $M$ balls are blue. Imagine that all the balls are distinguishable, say via engraved ID numbers.

There are $$M(M-1)\cdots (M-n+1)$$ sequences of $n$ balls, all equally likely.

There are $(M-b)(M-b-1)\cdots (M-b-(n-1)+1)$ sequences of $n-1$ non-blue balls, and therefore
$$(M-b)(M-b-1)\cdots (M-b-(n-1)+1)(b)$$ sequences of $n-1$ non-blues followed by a blue. Divide. For a more closed-looking form, the above products can be expressed using binomial coefficients and factorials.

Related Question