[Math] Multivariate Hypergeometric Distribution/Urn Problem

combinatoricsprobabilityprobability distributions

I am having a difficulty with the following multivariate hypergeometric distribution problem. The setting is as usual, an urn contains a total of $M$ balls of $K$ unique colors, with $N_1$ balls of color 1, $N_2$ balls of color 2, …, $N_K$ balls of color $K$ s.t. $N_1+N_2…+N_K = M$. What is the probability that in a sample of size $n$ (without replacement), the ball drawn last has a color not sampled before. For simplicity, we can assume that $N_1=N_2=…=N_K=N$, i.e. $M=KN$.

I have been trying to look at particular cases with $K=2$ and $K=3$ (2 or 3 colors) with different values of sample size, $n$, hoping I could generalize the formulas for arbitrary $K$ and $n$. Thus, for example, for $K=2$ and any value of $n$, I showed that the probability in question could be found by $K \cdot \frac{{N_1 \choose n-1}{N_2 \choose 1}}{n {M \choose n}}$. For $K=3$, we may have two different cases: a) only 2 out of the available three colors are sampled (with $n-1$ balls of the same color and 1 ball of a second color). The desired probability is then $K(K-1)\frac{{N_1 \choose 2}{N_2 \choose 0}{N_3 \choose 1}}{n {M \choose 3}}$. And case b) all three colors are sampled (1 ball of each), then the desired probability is $K(K-1) \frac{{N_1 \choose 1}{N_2 \choose 1}{N_3 \choose 1}}{n{M \choose 3}}$ and the final answer is the sum of (a) and (b).

Does this logic seem reasonable ? Obviously by increasing the values of $k$ and $n$ the number of cases to keep track of will increase too but it seems that each new case may be simplified into (or represented by) a previously worked out scenario. In short, it seems that I may eventually be able to find some recursive relation but after some tedious work. Any ideas would be greatly appreciated. More specifically – is this a good route to go? If yes, are there any shortcuts that I can take ? Is there a completely different approach that I can try ?
Thanks, in advance,
Tamar

Best Answer

One can find, for the general case, an explicit formula for the required probability. We show, for example, how to compute the probability that the last ball drawn is blue and no blue has been drawn in the first $n-1$ draws. For the answer, then add over all colours.

Suppose that $b$ of the $M$ balls are blue. Imagine that all the balls are distinguishable, say via engraved ID numbers.

There are $$M(M-1)\cdots (M-n+1)$$ sequences of $n$ balls, all equally likely.

There are $(M-b)(M-b-1)\cdots (M-b-(n-1)+1)$ sequences of $n-1$ non-blue balls, and therefore
$$(M-b)(M-b-1)\cdots (M-b-(n-1)+1)(b)$$ sequences of $n-1$ non-blues followed by a blue. Divide. For a more closed-looking form, the above products can be expressed using binomial coefficients and factorials.

Related Solutions

[Math] Multivariate Hypergeometric Distribution Questions

Let $(X_1, X_2, \ldots, X_k)$ be a random vector denoting number of balls of each color in the sample of size $m = X_1 + X_2 + \cdots + X_k$. The event "we have at least one of each color" translates into each $X_i$ being positive. The probability we seek is thus $$ p = \mathbb{P}\left( X_1 >0 \land X_2>0 \land \ldots \land X_k>0\right) = \sum_{x_1 =1}^{m-1} \sum_{x_2 =1}^{m-1} \cdots \sum_{x_k=1}^{m-1} \frac{\binom{N_1}{x_1} \binom{N_2}{x_2} \cdots \binom{N_k}{x_k}}{\binom{N}{m}} \delta_{m,x_1+x_2+\cdots+x_k} $$ For example, for $k=5$, and $N_i=10$ and $m=10$ the probability proves equal $$ \frac{30890625}{50108674} \approx 0.6165 $$

In the case where there is equal number of balls of different colors in the urn, i.e. $N_i = c$ for $ 1 \leqslant i \leqslant k$, we can use inclusion-exclusion principle to find the answer. The complementary probability $$\begin{eqnarray} 1 -p &=& \mathbb{P}\left(X_1 = 0 \lor X_2 = 0 \lor \ldots \lor X_k=0 \right) \\ &=& \sum_{1 \leqslant {i_1} \leqslant k} \mathbb{P}\left(X_{i_1}=0\right) - \sum_{1 \leqslant i_1 < i_2 \leqslant k} \mathbb{P}\left(X_{i_1}=0 \land X_{i_2}=0\right) \\ &\phantom{=}& + \sum_{1 \leqslant i_1 < i_2 < i_3 \leqslant k} \mathbb{P}\left((X_{i_1}=0 \land X_{i_2}=0 \land X_{i_3}=0\right) - \cdots \\ &\phantom{=}& - (-1)^k \mathbb{P}\left(X_1=0 \land X_2 = 0 \land \cdots \land X_k = 0\right) \end{eqnarray} $$ Due to exchangeability: $$ \mathbb{P}\left(X_{i_1}=0\right) = \mathbb{P}\left(X_1=0\right) = \frac{\binom{c}{0} \binom{(k-1)c}{m}}{\binom{k c}{m}} = \frac{\binom{(k-1)c}{m}}{\binom{k c}{m}} $$ $$ \mathbb{P}\left(X_1 = 0 \land X_2 = 0\right) = \frac{\binom{c}{0} \binom{c}{0} \binom{(k-2)c}{m}}{\binom{k c}{m}} = \frac{ \binom{(k-2)c}{m}}{\binom{k c}{m}} $$ and so on, with $\mathbb{P}\left(X_1 = X_2 = \ldots = X_s = 0\right) =\frac{ \binom{(k-s)c}{m}}{\binom{k c}{m}}$. Hence, given that $\sum_{1 \leqslant i_1 < i_2 < \ldots < i_s \leqslant k} 1 = \binom{k}{s}$ we arrive at the result: $$ 1- p = \sum_{s=1}^k (-1)^{s-1} \binom{k}{s} \frac{\binom{(k-s)c}{m}}{\binom{k c}{m}} $$ that is $$ p = \sum_{s=0}^k (-1)^s \binom{k}{s} \frac{\binom{(k-s) c}{m}}{\binom{k c}{m}} \stackrel{k \to s-k}{=} \sum_{s=0}^k (-1)^{k-s} \binom{k}{s} \frac{\binom{s c}{m}}{\binom{k c}{m}} $$

[Math] Drawing balls from an urn with balls from 3 different colors

When dealing without replacement, the result is just

$$\frac{{5 \choose 1}\cdot{6 \choose 1}\cdot{8 \choose 1}}{19 \choose 3}\approx0.2477$$

The denominator $${19 \choose 3}=\frac{19!}{\color{blue}{3!}\cdot16!}$$ serves to account for different orderings so you don't need to multiply by $3!$

Note that your result is greater than one and a probability must be between $0$ and $1$.

When working with replacement, this becomes a multinomial distribution. Letting $X_1$, $X_2$, and $X_3$ denote the number of red, blue, and green balls selected, respectively, we have

$$\mathsf P(X_1=X_2=X_3=1)=\frac{3!}{1!\cdot1!\cdot1!}\cdot\frac{5}{19}\cdot\frac{6}{19}\cdot\frac{8}{19}\approx0.2099$$

which agrees with your result.

R Simulation Without Replacement:

> urn = c(rep("red",5),rep("blue",6),rep("green",8))  
> u = replicate(10^6, length(unique(sample(urn,3,repl=F))))
> mean(u == 3)
[1] 0.247378

R Simulation With Replacement:

> urn = c(rep("red",5),rep("blue",6),rep("green",8))   
> u = replicate(10^6, length(unique(sample(urn,3,repl=T))))
> mean(u == 3)
[1] 0.210267

Best Answer

Related Solutions

[Math] Multivariate Hypergeometric Distribution Questions

[Math] Drawing balls from an urn with balls from 3 different colors

Related Question