Throwing n balls into m boxes, probability that all boxes have more than q balls

balls-in-binsbinomial-coefficientscombinatorics

Assuming we throw $n$ balls into $m$ boxes, I want to calculate the probability (not just come up with the formula) that all boxes have each more than $q$ balls in it (for small $q$).

$$P(\text{all boxes:} > q \text{ balls}) = 1 – P(\text{at least one box:} \leq q \text{ balls})$$

Now I wanted to use the inclusion exclusion principle and counted the possible combinations. As a little explanation for the formulas below: when counting the number of combinations of 2 boxes having each $\leq q$ balls, both boxes A and B can have each $a, b \in [0,q]$ balls in them. $n \choose a+b$ possibilities to pick the balls that go into A and B and $a+b \choose a$ possibilities which balls of those go into A and which go into B.

$$|\text{1 box:} \leq q \text{ balls}| = \sum_{i=0}^{q} {n \choose i} (m-1)^{n-i}$$

$$|\text{2 boxes:} \leq q \text{ balls}| = \sum_{i=0}^{q}\sum_{j=0}^{q} {n \choose i + j} {i + j \choose i} (m-2)^{n-i-j}$$

$$|\text{3 boxes:} \leq q | = \sum_{i=0}^{q}\sum_{j=0}^{q}\sum_{k=0}^{q} {n \choose i + j + k} {i + j+k \choose i+j} {i + j \choose i} (m-3)^{n-i-j-k}$$

$$…$$

But as soon as I have a few dozen/hundred boxes, I have no chance of computing the actual probability. I tried to come up with different approaches to count the combinations (e.g., distributing $2q$ balls (with ${n \choose 2k}$ possibilities of choosing them) across $n$ boxes and then distributing the remaining $m-2k$ boxes), but I ended up counting combinations twice and I have a feeling that this approach does not work. Is there another way of counting for this problem, that doesn't have this complexity in $m$?

Best Answer

Welcome to MSE!

Given the discussion in the comments, this post has been heavily edited to explain a rather silly mistake I made but hopefully also a bunch of new enumerative combinatorial ideas!

The idea is that often in combinatorics we want to calculate the number of ways of partitioning a set of objects into various piles, but there are in fact myriad ways of interpreting this statement. In fact, the various descriptions are so subtly different that instead the different counting notions are often rephrased in terms of counting various kinds of functions between a set $M = \{1,2,..., m\}$ and $N= \{1,2,...,n\}$.

We can first introduce some notation while simultaneously explaining the differences between some of the counting ideas. A partition of a bunch of objects is some way of splitting it into various piles so that, to be clear, each pile has at least one element. A placement (not standard terminology, but important for this answer) is a way of splitting up a collection of objects among what we can imagine to be a specified collection of jars, but we allow some of these jars to remain empty in such a division.

If $m$ jars are unlabeled (and thus indistinguishable) and the same goes for $n$ balls then the number of partitions of the balls among the jars is notated as $p_m(n)$. More generally, the number of partitions of $n$ indistinguishable objects into subsets (which are also indistinguishable objects) is called the $n$th partition number and is notated $p(n)$. As with a number of such enumerations, there is no simply stated formula known for either of these but clearly $$p(n) = \sum_{k=1}^np_k(n)$$ and we also have $$p_k(n)=p_k(n-k)+p_{k-1}(n-1)$$ Can you see why? Another particularly striking recursion relation is given by Euler's Pentagonal Number Theorem. We also know, thanks to Ramanujan and Hardy, the famous asymptotic equivalence $$p(n) \sim \frac{1}{4n\sqrt{3}}\cdot\exp\left(\pi\sqrt{\frac{2n}{3}}\right) $$ which at least tells you the rough order of growth!

Now suppose you instead want to consider the number of partitions of n balls that are in fact numbered from $1,...,n$ (and thus all distinguishable from each other) into $m$ indistinguishable jars: this is called a Stirling number of the Second Kind and is rather neatly denoted $$\begin{Bmatrix} n\\ m \end{Bmatrix}$$ where you can use the curly brackets to make you think of the fact that the piles are indistinguishable, i.e. they are like subsets of the numbers from $1,...,n$. Much again can (and is) written about these - the total number of ways of partitioning $n$ distinguishable balls into various indistinguishable piles is called the $n$th Bell Number.

If the balls we are partitioning are instead indistinguishable but the jars we put them into are instead labeled, then a stars and bars argument in fact tells you there are $\binom{n-1}{m-1}$ ways of doing this.

Finally, if both the balls and piles are labeled and distinguishable then we find there are $$m! \begin{Bmatrix} n\\ m\end{Bmatrix}$$ ways. This is because there are precisely $m!$ ways of converting each partition where the balls are labeled but the jars not into one where the jars are labeled (i.e. $m!$ ways of writing the numbers $1,...,m$ each exactly once on $m$ jars).

Building on this notation we can discuss the things more relevant to the probability. There are four possibilities:

  1. Suppose we want to know the number of ways of partitioning $n$ unordered balls into $m$ piles, where each pile contains at least $q$ members. This can be written $p^q_m(n)$ and a very nice argument here explains that it is equal to $p_m(n-m(q-1))$. We may also work out the total number of possible placings of $n$ unordered balls into $m$ unlabeled jars: this is just the same as the number of ways of partitioning the $n$ balls into $m$ or fewer piles (when we consider fewer than $m$, we're accounting for the "empty piles") and so it is $$\sum_{k=1}^m p_m(n) = p_m(n+m)$$

  2. The number of ways of partitioning $n$ ordered balls (and thus distinguishable) into $m$ unlabeled jars such that each contains at least $q$ members: this is precisely the definition of the $q$-associated Stirling number of the Second Kind $S_q(n,m)$ as discussed in my original answer. The total number of placements of $n$ ordered balls into unlabeled piles is calculated similar to in 1. above: it is the same as the number of ways of partitioning the $n$ balls into $m$ or fewer piles and so it is $$\sum_{k=1}^m \begin{Bmatrix} n\\ m\end{Bmatrix}$$

  3. If the $n$ balls are unlabeled but the jars are, then another stars and bars argument can tell you that the number of partitions where each pile has at least $q$ members is $$\binom{n-m(q-1)-1}{m-1}$$ while the number of placements of $n$ unlabeled balls into labeled piles is $$\binom{m+n-1}{n}$$

  4. Finally if the balls and the piles are labeled, the total number of partitions where each pile has at least $q$ members is simply $m!S_q(n,m)$ by a similar argument to one we made above about the general partitions of this kind: each partition counted by $S_q(n,m)$, where the balls are ordered but the jars not, can be made into one where the jars are ordered in $m!$ ways! How many total placements are possible with $n$ labeled balls into labeled piles/jars? This is really simple: $m^n$ (by the obvious argument).

That was a lot to digest: perhaps make a few tables! Remember, there are three kinds of piles we've looked at: ones where piles have to be nonempty (partition), ones where each pile has to contain at least $q$ things and ones where the piles can be empty (placement). There are then two categories we split by: whether or not the balls were labeled and whether not the piles were labeled.

Which of these types of counting are relevant to our problem at hand? We wanted to calculate the probability that, when we throw $n$ balls into $m$ boxes, all boxes have each more than $q$ balls in them. We may work out this probability by imagining what we are describing as a process: we want to think of the number of ways of throwing $n$ balls, one by one in a sequence, into $m$ boxes such that they all end up with at least $q$ and divide this by the total number of such sequences, i.e. here the balls are labeled (as we place them in via a sequence of steps) and the boxes are labeled (we imagine them to be in a line). Notice that the boxes' labels are really an important part of the situation: to pick a random box at each stage of the process, we need to have a fixed labeling of the boxes with the integers from 1 to $m$ and then choose a random integer in that range.

Thus the probability we seek is $\frac{m!S_q(n,m)}{m^n}$. I initially wrote $p = \frac{S_q(n,m)}{m^n}$, so the denominator counted the number of partitions of $n$ ordered balls into $m$ unordered boxes. It's an easy mistake to make because you can say to yourself, "In the end, the outcome asks whether or not there are at least $q$ balls in each box. This statement is true or false for a given run regardless of whether or not the boxes have labels on them and so the boxes may as well be label-less.". A similar argument can be made to seemingly suggest the balls don't need to be labeled in our counting either. The fractions between such countings, however, do not represent the numbers of ways of things happening in this probabilistic process - this is a great example of need to define your events and sample space carefully.


Food for thought:

In the above, when we switched from ordered balls, unordered boxes to ordered balls, ordered boxes we saw that $S_q(n,m)$ changes to $m!S_q(n,m)$ and similarly $\begin{Bmatrix} n\\ m\end{Bmatrix}$ changes to $m!\begin{Bmatrix} n\\ m\end{Bmatrix}$ (when considering the number of partitions where each pile has at least $q$ members and the general number of partitions into $m$ boxes respectively). Why didn't $\sum_{k=1}^m \begin{Bmatrix} n\\ k\end{Bmatrix}$ become $m!\sum_{k=1}^m \begin{Bmatrix} n\\ k\end{Bmatrix}$ and instead become $m^n$? Can you see why/how this relates to the fact that $$m^n = \sum_{k=1}^m \begin{Bmatrix} n\\ k\end{Bmatrix}\cdot \frac{m!}{(m-k)!}$$?

If you thought this wasn't confusing enough and wanted to learn more interesting facts about things that are incredibly subtly different, there are a number of other important counting sequences such as the Catalan numbers, ordered Bell numbers, Lah numbers and much much more.

Related Question