[Math] Second pair of matching birthdays

balls-in-binsprobability

The "birthday problem" is well-known and well-studied. There are many versions of it and many questions one might ask. For example, "how many people do we need in a room to obtain at least a 50% chance that some pair shares a birthday?" (Answer: 23)

Another is this: "Given $M$ bins, what is the expected number of balls I must toss uniformly at random into bins before some bin will contain 2 balls?" (Answer: $\sqrt{M \pi/2} +2/3$)

Here is my question: what is the expected number of balls I must toss into $M$ bins to get two collisions? More precisely, how many expected balls must I toss to obtain the event "ball lands in occupied bin" twice?

I need an answer for very large $M$, so solutions including summations are not helpful.


Silly Observation:

The birthday problem predicts we need about 25 US Presidents for them to share a birthday. It actually took 28 presidents to happen (Harding and Polk were both born on Nov 2). We see from the answers below that after about 37 US Presidents we should have a 2nd collision. However Obama is the 43rd and it still hasn't happened (nor would it have happened if McCain had won or Romney had won; nor will it happen if H. Clinton wins in 2016).

Best Answer

I'll first write down the exact answer (summation included), and then various approximations.

Let $X$ be the random variable denoting the number of balls after which the second collision occurs. We want $E[X]$. As $X$ takes only non-negative integer values, its expectation is $$E[X] = \sum_{n=1}^{\infty}\Pr(X \ge n) = \sum_{n=0}^{\infty}\Pr(X > n)$$

The probability $\Pr(X > n)$ is the probability that after $n$ balls are thrown into the $m$ bins, the number of collisions is either $0$ or $1$. The probability of $0$ collisions is, as in the classical birthday problem, $\dfrac{m(m-1)\dots(m-n+1)}{m^n}$. The probability of exactly one collision can be calculated as follows: for this to happen, some pair of balls must go to the same bin, after which the other $n-2$ balls must go to the other $m-1$ bins without collision, so the probability of this is $\displaystyle \binom{n}{2} m \frac{1}{m^2} \frac{(m-1)(m-2)\dots(m-n+2)}{m^{n-2}} = \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n}$. Note that this is precisely $\binom{n}{2} \frac{1}{m-n+1}$ times the probability of $0$ collisions.

For the classical birthday problem, the same analysis as above gave that the expected number of balls until collision (call this random variable $Y$) is

$$ \begin{align} E[Y] &= \sum_{n=0}^{\infty} \frac{m(m-1)\dots(m-n+1)}{m^n} \\ &= 1 + 1 + \frac{m-1}{m} + \frac{(m-1)(m-2)}{m^2} + \frac{(m-1)(m-2)}{m^3} + \dots \end{align}$$

The asymptotic expansion of this sum, related to Ramanujan's Q-function, is known to be $$\sqrt{\frac{\pi m}{2}} + \frac{2}{3} + \frac{1}{12}\sqrt{\frac{\pi}{2m}} - \frac{4}{135m} + \dots$$

In our case, the expected number of balls until two collisions is the old sum, plus additionally the sum of the terms coming from the probabilities of having exactly one collision: it is $$ \begin{align} E[X] &= \sum_{n=0}^{\infty} \left( \frac{m(m-1)\dots(m-n+1)}{m^n} + \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n} \right) \\ &= 1 + 1 + \frac{m-1}{m} + \frac{(m-1)(m-2)}{m^2} + \frac{(m-1)(m-2)(m-3)}{m^3} + \dots\\ &\phantom{=1+1} + \frac{1}{m} + 3\frac{(m-1)}{m^2} + 6\frac{(m-1)(m-2)}{m^3} + \dots \end{align} $$

That's the exact expression; now for the approximations.


One trivial fact we can prove rigorously is that $E[Y] < E[X] < 2E[Y]$. The former inequality is from the fact that the second collision can only happen after the first one, so $X \ge Y$. The second inequality is from the fact that after the first collision if we clear all the bins and wait for a collision to happen again from scratch, the second collision can only take longer than in the normal case. So $$\sqrt{\frac{\pi m}{2}} < E[X] < \sqrt{2\pi m}$$


One avenue of approximation is to instead calculate the "median" $X$, as Ross Millikan's answer does: find the number of balls $n$ for which the probability of $0$ or $1$ collisions is $\frac{1}{2}$. The expected value of $X$ will be around this. (And also strictly more than half of this $n$, because we have $E[X] \ge n \Pr(X \ge n) = \frac{n}{2}$.)

His answer already gives one way, here is another. From the expression we calculated above for the probability, we want $n$ such that $$ \begin{align} \frac12 &= \frac{m(m-1)\dots(m-n+1)}{m^n} + \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n} \\ &= \left(\frac{m-n+1}{m} + \binom{n}{2}\frac{1}{m} \right) \frac{m(m-1)(m-2)\dots(m-n+2)}{m^{n-1}} \\ &= \left(\frac{m-n+1}{m} + \binom{n}{2}\frac{1}{m} \right) \left(1 - \frac1m \right) \left(1 - \frac2m \right) \cdot \dots \cdot \left(1 - \frac{n-2}m \right) \\ &\approx \left(1 + \frac{n^2}{2m} \right) e^{-1/m} e^{-2/m} e^{-3/m} \dots e^{-(n-2)/m} \\ &\approx \left(1 + \frac{n^2}{2m} \right) e^{-n^2/2m} \\ \end{align}$$ which gives exactly the same approximation $\frac{n^2}{2m} \approx 1.678$ as in Ross Millikan's answer! Specifically, it gives $E[X] \approx 1.83\sqrt{m}$.


Another approximation is to use $E[X] = E[E[X|Y]]$ and assume that $Y$ (the time of the first collision) is concentrated about its mean. After the first collision happens (after $Y$ balls), there are $Y-1$ nonempty bins, so about a fraction $\frac{Y}{m}$ of the bins are nonempty and a fraction $1 - \frac{Y}{m}$ of the bins are empty. The second collision happens when either a ball lands into one of the currently occupied bins, or a ball lands for the second time into one of the bins not yet occupied. The former is much more likely to happen first than the latter. The expected time that the former takes is $\dfrac1{Y/m}$, as it's a geometric distribution (like tossing a coin that has probability $\frac{Y}{m}$ of heads, until heads comes up). So $$E[X] \approx E\left[Y + \frac{m}{Y}\right] $$ which if $Y$ is strongly concentrated around $\sqrt{\frac{\pi m}{2}}$, will be $$E[X] \approx \sqrt{\frac{\pi m}{2}} + \sqrt{\frac{2m}{\pi}} = \sqrt{m}\left( \sqrt{\frac{\pi}2} + \sqrt{\frac2\pi} \right)$$ which ($E[X] \approx 2.05\sqrt{m}$) is slightly more than the answer given by the first approximation.


We know that $E[X]$ is between $\sqrt{\frac{\pi}{2}}\sqrt{m}$ and $\sqrt{2\pi}\sqrt{m}$, so we know for sure that asymptotically, $E[X] \sim c\sqrt{m}$ for some constant $c$. We just need to find the constant $c$, and it seems the best way is to write a program for it. Here's a Python program that uses the exact formula above to calculate the value of $E[X]$ for increasing $m$, and prints $\frac{E[X]}{\sqrt{m}}$:

import math

def x(m):
    ans = 2
    term = 1
    other_term = 1.0 / m
    for n in range(2, m + 2):
        term *= (m - (n - 1.0))  / m
        other_term *= (m - (n - 2.0)) / m
        ans += term + (n * (n - 1) * other_term) / 2
    return ans

m = 1
while True:
    m *= 2
    n = x(m)
    print m, n, n / math.sqrt(m)

which prints (among other things):

...
16777216 7701.36209791 1.88021535594
33554432 10890.956487 1.88014384413
67108864 15401.7241385 1.88009327862
134217728 21780.9129334 1.88005752389
...

so $c \approx 1.88$ or $E[X] \approx 1.88\sqrt{m}$ seems to be the true statement.


We can say more about birthdays specifically ($m = 365$). With a minor change to the above program, for each $n$, we can calculate exactly the probability that $X = n$ (i.e., that it takes $n$ people until the second collision happens). This turns out to give the following distribution.

 n       P(X > n)      P(X = n)
 25      0.810743      0.023477
 26      0.785794      0.024950
 27      0.759490      0.026304
 28      0.731969      0.027521
 29      0.703384      0.028585
 30      0.673899      0.029485
 31      0.643690      0.030209
 32      0.612939      0.030752
 33      0.581830      0.031109
 34      0.550548      0.031281
 35      0.519278      0.031270
 36      0.488198      0.031081
 37      0.457476      0.030721
 38      0.427275      0.030202
 39      0.397741      0.029533
 40      0.369011      0.028730
 41      0.341204      0.027807
 42      0.314426      0.026779
 43      0.288763      0.025662
 44      0.264290      0.024473
 45      0.241061      0.023229
 46      0.219117      0.021944
 47      0.198482      0.020635
 48      0.179168      0.019315
 49      0.161170      0.017997
 50      0.144476      0.016695
 51      0.129058      0.015418
 52      0.114881      0.014177
 53      0.101902      0.012979
 54      0.090072      0.011830
 55      0.079334      0.010738

So as $P(X > 43)$ is still a very healthy $0.288763$, it's nothing to be very surprised about that there is still only one collision among the 43 US presidents so far. :-)