N black balls k white balls in M bins, what is the probability of selecting a black ball from any bin.

balls-in-binscombinatoricspolya-urn-modelprobability

There are $n$ black balls and $k$ white, with $M$ bins. The process for filling the bins is as follows:

  • Consider each of the $N=n+k$ balls, one at a time.
  • For each ball, select 1 of the $M$ bins uniformly at random.
  • Place the ball in the bin.

The $N$ balls are allocated into the $M$ bins. Now, from a bin select a single ball, if a bin contains multiple balls a single ball is chosen uniformly at random. I want to work out the probability that a black ball is selected from a particular bin?

Here is my approach:

Let, $X_r$ be the number of balls of type $r\in \{b,w\}$ in a bin let $y_r=n$ for black balls and $y_r=k$ for white balls. Then the probability it has $s$ balls of type $r$ is given by the binomial distribution $$P(X_r=s)=\binom{y_r}{s}\left(\frac{1}{M}\right)^s\left(1-\frac{1}{M} \right)^{N-s}.$$

By independence of the events the probability that a bin has $i$ white and $j$ black balls in it $(t=i+j)$, with $X=X_w+X_b$, is
$$P(X=t)=\binom{n}{i}\binom{k}{j}\left(\frac{1}{M}\right)^t\left(1-\frac{1}{M} \right)^{N-t}.$$
Let $P(b_t)$ be the probability of selecting a black ball given $t$ balls in the bin. Since, the selection is uniformly at random it should just be the product of the fraction of black balls and the likelihood of t balls being in the bin
$$P(b_{t=i+j})=\frac{j}{i+j}\binom{n}{i}\binom{k}{j}\left(\frac{1}{M}\right)^t\left(1-\frac{1}{M} \right)^{N-t}.$$
Now, the total probability of choosing a black ball is given by
$$ \sum_{i=0}^k \sum_{j=1}^n \frac{j}{i+j}\binom{n}{i}\binom{k}{j}\left(\frac{1}{M}\right)^t\left(1-\frac{1}{M} \right)^{N-t} \tag{*}\label{*}$$
where index $j$ starts at 1, since if no black balls are in the bin the probability of selecting one is zero and it ensures we don't get the undefined $\frac{0}{0}$ when $i=0$.

I think that $\eqref{*}$ is correct. My question is what ways can $\eqref{*}$ be simplified? I dont think I can use Vandermonde's Identity as the term $\left(\frac{1}{M}\right)^i$ is not able to be taken out of the sum over $i$ for example.

I have put it into mathematica and get the following

 Sum[(s/(s + t))*Binomial[nh, s]*Binomial[nl, t]*(1/M)^(s + t)*(1 - 1/M)^(nh + nl - s - t), 
    {s, 1, nh}, {t, 0, nl}]


ProbHireH[M, nh, nl]=(1 - 1/M)^(m + n) (1/((1 - 1/M) M))^n 
DifferenceRoot[Function[{\[FormalY], \[FormalN]}, {(-1 + M) (-1 - \[FormalN] + 
          n) (-\[FormalN] + m + n) \[FormalY][\[FormalN]] + (2 + 
          5 \[FormalN] + 3 \[FormalN]^2 - 2 m - 2 \[FormalN] m - M - 
          3 \[FormalN] M - 2 \[FormalN]^2 M + m M + \[FormalN] m M - 
          3 n - 4 \[FormalN] n + m n + 2 M n + 3 \[FormalN] M n - 
          m M n + n^2 - M n^2) \[FormalY][
         1 + \[FormalN]] - (1 + \[FormalN]) (4 + 3 \[FormalN] - m - 
          M - \[FormalN] M - 2 n + M n) \[FormalY][
         2 + \[FormalN]] + (1 + \[FormalN]) (2 + \[FormalN]) \
\[FormalY][3 + \[FormalN]] == 0, \[FormalY][0] == 0, \[FormalY][1] == 
      Hypergeometric2F1[-m, n, 1 + n, -(1/(-1 + M))], \[FormalY][
       2] == (1 - 1/M) M n Hypergeometric2F1[-m, -1 + n, 
         n, -(1/(-1 + M))] + 
       Hypergeometric2F1[-m, n, 1 + n, -(1/(-1 + M))]}]][n]

Now, while this is a simplification of sorts, I am unsure that this is the simplest respresentation. This is because with the independence of everything I think that the probability of choosing a black ball from a bin after the allocation, could be the product of the probability that a bin has at least one ball and the fraction of black balls overall
$$ \left(1- \left(1-\frac{1}{M}\right)^{n+k}\right)\frac{n}{n+k}. \tag{**}\label{**}$$

I can veryify (although not exhaustively) that $\eqref{*}$ and $\eqref{**}$ yield the same probabilities, for example

SimpleProbHireH[M_, nh_, nl_] := (1 - (1 - 1/M)^(nh + nl))*nh/(nh + nl)

SimpleProbHireH[10, 5, 5]=6513215599/20000000000 

ProbHireH[M_, nh_, nl_] := 
 Sum[(s/(s + t))*Binomial[nh, s]*Binomial[nl, t]*(1/M)^(s + t)*(1 - 1/M)^(nh + nl - s - t), {s, 1, nh}, {t, 0, nl}]


ProbHireH[10, 5, 5]=6513215599/20000000000

Any help with this would be much appreciated.

Best Answer

Now, from a bin select a single ball, if a bin contains multiple balls a single ball is chosen uniformly at random.

And what happens if the bin you choose is empty?

The basis of your calculation is the implicit assumption that in that case you don't select any ball at all. A perfectly reasonable assumption, by the way. It means we really have three possible events we could consider:

\begin{align} B &= \text{You select a black ball.}\\ W &= \text{You select a white ball.}\\ E &= \text{You select a no ball.}\\ \end{align}

The probability of event $E$ is easy to figure out. It is the probability that the bin is empty:

$$ \mathbb P(E) = \left(1-\frac{1}{M}\right)^{n+k}. $$

But suppose the bin is not empty. Given that the bin is not empty, which occurs with probability $1 - \mathbb P(E),$ there are $n+k$ balls in the problem, any one of which could first have been placed in the bin and then selected from it.

Label the black balls $1,\ldots,n$ and the white balls $1,\ldots,k$; let $B_j$ be the event of drawing black ball $j$ and let $W_i$ be the event of drawing white ball $i.$ Given that the bin is not empty, no ball has a better chance than any other ball of being the ball eventually selected from the chosen bin, so $\mathbb P(B_j \mid E^\complement) = \mathbb P(W_i \mid E^\complement) = \mathbb P(B_1 \mid E^\complement)$ for all $i$ and $j$. Since the events $B_j$ and $W_i$ are disjoint and one of them must happen given that the bin is not empty, by the law of total probability, $$ 1 = \sum_{j=1}^n \mathbb P(B_j \mid E^\complement) + \sum_{i=1}^k \mathbb P(W_i \mid E^\complement) = (n + k) \mathbb P(B_1 \mid E^\complement), $$ that is, each ball has conditional probability $1/(n+k)$ of being selected.

But there are $n$ black balls, so $$ \mathbb P(B \mid E^\complement) = \frac n{n+k}. $$

Therefore \begin{align} \mathbb P(B) &= \mathbb P(B \mid E^\complement) \mathbb P(E^\complement) \\ &= \mathbb P(B \mid E^\complement) (1 - \mathbb P(E)) \\ &= \frac n{n+k} \left(1 - \left(1-\frac1M\right)^{n+k}\right). \end{align}


Now it would be nice to know how to get this from the expression you put together earlier. But first that expression has to be correct. I agree with your formula for $\mathbb P(X_r = s),$ but a formula for $\mathbb P(X=t)$ would have to account for all the different ways you might have $t$ balls in the bin, which your formulas never explicitly do (and it's just as well that they don't).

What you actually seem to be trying to show in your next formula is the probability of $i$ white balls and $j$ black balls, $\mathbb P(X_w=i,X_b=j).$ Moreover, you mistakenly swapped $n$ and $k$ in the binomial coefficients: it's $i$ white balls out of $k$ and $j$ black balls out of $n$, so the formula should be

$$ \mathbb P(X_w=i,X_b=j) = \binom ki \binom nj \left(\frac1M \right)^{i+j} \left(1 - \frac1M\right)^{k-i+n-j}. $$

I have written this without the symbols $N$ and $t$ since I don't think they're actually helpful at this point.

The next formula you wanted to show was not the probability that you choose a black ball given that there are $i$ white balls and $j$ black balls (which would just be $j/(i+j)$), it was the probability that you choose a black ball and there are $i$ white balls and $j$ black balls:

$$ \mathbb P(B,X_w=i,X_b=j) = \frac{j}{i+j} \binom ki \binom nj \left(\frac1M \right)^{i+j} \left(1 - \frac1M\right)^{k-i+n-j}. $$

So now formula (*), correctly written, is $$ \mathbb P(B) = \sum_{i=0}^k \sum_{j=1}^n \frac{j}{i+j} \binom ki \binom nj \left(\frac1M \right)^{i+j} \left(1 - \frac1M\right)^{k-i+n-j}. \tag1 $$

Now what I'd like to do is to rearrange the order of summation so the outer variable is the total number of balls, $t = i+j.$ To cover terms of the sum in Equation $(1)$ we need $t$ to range from at least $1$ to $n+k.$ If we take the convention that $\binom xy = 0$ when $y > x,$ and also set $p = \frac1M$ and $q = 1 - \frac1M,$ the sum can be written this way: $$ \mathbb P(B) = \sum_{t=1}^{n+k} \sum_{j=1}^t \frac jt \binom k{t-j} \binom nj p^t q^{n+k-t}. \tag2 $$ The sum in Equation $(2)$ generally has more terms than the sum in Equation $(1),$ but the extra terms all occur for the cases $j > n$ and $t - j > k,$ and in these cases either $\binom nj = 0$ or $\binom k{t-j} = 0,$ so the extra terms conveniently vanish. (Alternatively, we could deal with the $j > n$ case by setting the inner sum's upper bound at $j = \max\{n,t\}$ and deal with the $t - j > k$ case by applying other jiggery-pokery to the lower bound, but the zero terms let us keep the bounds simple.)

Since the powers of $p$ and $q$ do not depend on $j,$ we can factor them out of the inner sum, along with the factor $\frac1t$: $$ \mathbb P(B) = \sum_{t=1}^{n+k} p^t q^{n+k-t} \cdot \frac1t \sum_{j=1}^t j \binom k{t-j} \binom nj . $$

Now to simplify the inner sum, suppose we randomly select $t$ balls from a collection of $n$ black balls and $k$ white balls and let $Y$ be the number of black balls chosen. The expected number of black balls is $$ \mathbb E(Y) = \sum_{j=0}^t j\, \mathbb P(Y = j) = \sum_{j=0}^t j \frac{\binom k{t-j} \binom nj}{\binom{n+k}t}, $$ again using the convention $\binom xy = 0$ when $y > x$ to deal with the terms where either $j$ or $t-j$ is too large.

But setting $Y_m=1$ if the $m$th ball drawn is black, $Y_m=0$ otherwise, $\mathbb E(Y_m) = \frac{n}{n+k}$ for $1 \leq m \leq t$ and therefore by linearity of expectation, $$ \mathbb E(Y) = \sum_{m=1}^t \mathbb E(Y_m) = \frac{tn}{n+k}. $$ Setting the far right sides of both equations for $\mathbb E(Y)$ equal and multiplying both by $\frac1t \binom{n+k}t,$ we conclude that $$ \frac1t \sum_{j=0}^t j \binom k{t-j} \binom nj = \frac{n}{n+k} \binom{n+k}t. $$

We can therefore further simplify Equation $(2)$ as follows: \begin{align} \mathbb P(B) &= \sum_{t=1}^{n+k} p^t q^{n+k-t} \frac{n}{n+k} \binom{n+k}t \\ &= \frac{n}{n+k} \sum_{t=1}^{n+k} \binom{n+k}t p^t q^{n+k-t} \\ &= \frac{n}{n+k} \left( \left(\sum_{t=0}^{n+k} \binom{n+k}t p^t q^{n+k-t}\right) - q^{n+k} \right)\\ &= \frac{n}{n+k} \left( (p + q)^{n+k} - q^{n+k} \right)\\ &= \frac{n}{n+k} \left(1 - q^{n+k} \right)\\ \end{align}

In short, by adding up all the terms generated by formula (*), we conclude that $$ \mathbb P(B) = \frac n{n+k} \left(1 - \left(1-\frac1M\right)^{n+k}\right), $$ in agreement with the argument in the first part of this answer.