N*m distinguishable balls with m different colors, the probability of randomly choosing k balls containing all balls from at least 2 different colors

balls-in-binsbinomial distributioncombinationscombinatoricsprobability

Assume we have m groups of n balls and the balls in the same group have the same color. So there are m*n balls in total. Now, suppose we randomly choose k>(2*n) balls from the set of m*n balls. How much is the probability that the chosen k balls contain all balls of at least two different colors (entirely all balls of two groups)?

In other words, the unchosen set of balls contains the balls of different m-2 colors at most (instead of m colors).

To grasp better, notice the picture of 3*4 ( = n*m) balls. Each group of 3 balls has the same color. The probability I am looking for is to choose k balls containing the balls of two entire groups. For example, choosing balls 1, 5, 9, 3, 7, 11, 8 (contains all yellow and blue balls).

4 groups of 3 balls with the same colors: n = 3, m = 4

I hope I could explain the problem clearly. I have implemented a simulator for testing different scenarios. Then I tested the simulated results with different combinatorial/binomial solutions. But I get different results every time and now I'm lost.

This is my simulator in python testing different choices many times:

from random import sample
from collections import Counter
m = 4
n = 3
k = 5
it = 100000
balls = range(m*n)
cf = 0
for i in range(it):
    choices = sample(balls, k)
    samecolors = map(lambda x:x%m, choices)
    cnt = Counter(samecolors)
    mc = cnt.most_common(2)
    if (mc[-1][-1] == n): // if the second most common chosen color has *n* balls
        cf += 1
print(float(cf)/float(it))

Best Answer

The overall number of ways to choose $k$ out of $mn$ balls is $\binom{mn}{k}$, where all balls are assumed to be distinguishable. Among them naively there are $$\binom mr\binom{mn-rn}{k-rn}$$ combinations consisting of at least $r$ full sets of balls of the same color. However if $k\ge(r+1)n$ the above expression will double-count all combinations consisting of more than $r$ full sets, which should be accounted for. The correct way for this is the generalised inclusion-exclusion principle: $$ \nu_r=\sum_{j\ge r}(-1)^{j-r}\binom jr\binom mj\binom{mn-jn}{k-jn}, $$ which gives the number of combinations with exactly $r$ full sets.

To obtain the number of combinations with at least $r$ full sets one should sum the above expressions: $$\begin{align} N_r=\sum_{i\ge r}\nu_i&=\sum_{i\ge r}\sum_{j\ge i}(-1)^{j-i}\binom ji\binom mj\binom{mn-jn}{k-jn}\\ &=\sum_{j\ge i}(-1)^j\binom mj\binom{mn-jn}{k-jn}\sum_{i\ge r}(-1)^{i}\binom ji\\ &=\sum_{j\ge r}(-1)^{j-r}\binom{j-1}{r-1}\binom mj\binom{mn-jn}{k-jn}. \end{align}$$

Thus, the probability in question reads (with $r=2$): $$ p_r=\frac {\sum_{j\ge r}(-1)^{j-r}\binom{j-1}{r-1}\binom mj\binom{mn-jn}{k-jn}}{\binom{mn}k}. $$

Related Question