The "birthday problem" is well-known and well-studied. There are many versions of it and many questions one might ask. For example, "how many people do we need in a room to obtain at least a 50% chance that some pair shares a birthday?" (Answer: 23)
Another is this: "Given $M$ bins, what is the expected number of balls I must toss uniformly at random into bins before some bin will contain 2 balls?" (Answer: $\sqrt{M \pi/2} +2/3$)
Here is my question: what is the expected number of balls I must toss into $M$ bins to get two collisions? More precisely, how many expected balls must I toss to obtain the event "ball lands in occupied bin" twice?
I need an answer for very large $M$, so solutions including summations are not helpful.
Silly Observation:
The birthday problem predicts we need about 25 US Presidents for them to share a birthday. It actually took 28 presidents to happen (Harding and Polk were both born on Nov 2). We see from the answers below that after about 37 US Presidents we should have a 2nd collision. However Obama is the 43rd and it still hasn't happened (nor would it have happened if McCain had won or Romney had won; nor will it happen if H. Clinton wins in 2016).
Best Answer
I'll first write down the exact answer (summation included), and then various approximations.
Let $X$ be the random variable denoting the number of balls after which the second collision occurs. We want $E[X]$. As $X$ takes only non-negative integer values, its expectation is $$E[X] = \sum_{n=1}^{\infty}\Pr(X \ge n) = \sum_{n=0}^{\infty}\Pr(X > n)$$
The probability $\Pr(X > n)$ is the probability that after $n$ balls are thrown into the $m$ bins, the number of collisions is either $0$ or $1$. The probability of $0$ collisions is, as in the classical birthday problem, $\dfrac{m(m-1)\dots(m-n+1)}{m^n}$. The probability of exactly one collision can be calculated as follows: for this to happen, some pair of balls must go to the same bin, after which the other $n-2$ balls must go to the other $m-1$ bins without collision, so the probability of this is $\displaystyle \binom{n}{2} m \frac{1}{m^2} \frac{(m-1)(m-2)\dots(m-n+2)}{m^{n-2}} = \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n}$. Note that this is precisely $\binom{n}{2} \frac{1}{m-n+1}$ times the probability of $0$ collisions.
For the classical birthday problem, the same analysis as above gave that the expected number of balls until collision (call this random variable $Y$) is
$$ \begin{align} E[Y] &= \sum_{n=0}^{\infty} \frac{m(m-1)\dots(m-n+1)}{m^n} \\ &= 1 + 1 + \frac{m-1}{m} + \frac{(m-1)(m-2)}{m^2} + \frac{(m-1)(m-2)}{m^3} + \dots \end{align}$$
The asymptotic expansion of this sum, related to Ramanujan's Q-function, is known to be $$\sqrt{\frac{\pi m}{2}} + \frac{2}{3} + \frac{1}{12}\sqrt{\frac{\pi}{2m}} - \frac{4}{135m} + \dots$$
In our case, the expected number of balls until two collisions is the old sum, plus additionally the sum of the terms coming from the probabilities of having exactly one collision: it is $$ \begin{align} E[X] &= \sum_{n=0}^{\infty} \left( \frac{m(m-1)\dots(m-n+1)}{m^n} + \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n} \right) \\ &= 1 + 1 + \frac{m-1}{m} + \frac{(m-1)(m-2)}{m^2} + \frac{(m-1)(m-2)(m-3)}{m^3} + \dots\\ &\phantom{=1+1} + \frac{1}{m} + 3\frac{(m-1)}{m^2} + 6\frac{(m-1)(m-2)}{m^3} + \dots \end{align} $$
That's the exact expression; now for the approximations.
One trivial fact we can prove rigorously is that $E[Y] < E[X] < 2E[Y]$. The former inequality is from the fact that the second collision can only happen after the first one, so $X \ge Y$. The second inequality is from the fact that after the first collision if we clear all the bins and wait for a collision to happen again from scratch, the second collision can only take longer than in the normal case. So $$\sqrt{\frac{\pi m}{2}} < E[X] < \sqrt{2\pi m}$$
One avenue of approximation is to instead calculate the "median" $X$, as Ross Millikan's answer does: find the number of balls $n$ for which the probability of $0$ or $1$ collisions is $\frac{1}{2}$. The expected value of $X$ will be around this. (And also strictly more than half of this $n$, because we have $E[X] \ge n \Pr(X \ge n) = \frac{n}{2}$.)
His answer already gives one way, here is another. From the expression we calculated above for the probability, we want $n$ such that $$ \begin{align} \frac12 &= \frac{m(m-1)\dots(m-n+1)}{m^n} + \binom{n}{2} \frac{m(m-1)(m-2)\dots(m-n+2)}{m^n} \\ &= \left(\frac{m-n+1}{m} + \binom{n}{2}\frac{1}{m} \right) \frac{m(m-1)(m-2)\dots(m-n+2)}{m^{n-1}} \\ &= \left(\frac{m-n+1}{m} + \binom{n}{2}\frac{1}{m} \right) \left(1 - \frac1m \right) \left(1 - \frac2m \right) \cdot \dots \cdot \left(1 - \frac{n-2}m \right) \\ &\approx \left(1 + \frac{n^2}{2m} \right) e^{-1/m} e^{-2/m} e^{-3/m} \dots e^{-(n-2)/m} \\ &\approx \left(1 + \frac{n^2}{2m} \right) e^{-n^2/2m} \\ \end{align}$$ which gives exactly the same approximation $\frac{n^2}{2m} \approx 1.678$ as in Ross Millikan's answer! Specifically, it gives $E[X] \approx 1.83\sqrt{m}$.
Another approximation is to use $E[X] = E[E[X|Y]]$ and assume that $Y$ (the time of the first collision) is concentrated about its mean. After the first collision happens (after $Y$ balls), there are $Y-1$ nonempty bins, so about a fraction $\frac{Y}{m}$ of the bins are nonempty and a fraction $1 - \frac{Y}{m}$ of the bins are empty. The second collision happens when either a ball lands into one of the currently occupied bins, or a ball lands for the second time into one of the bins not yet occupied. The former is much more likely to happen first than the latter. The expected time that the former takes is $\dfrac1{Y/m}$, as it's a geometric distribution (like tossing a coin that has probability $\frac{Y}{m}$ of heads, until heads comes up). So $$E[X] \approx E\left[Y + \frac{m}{Y}\right] $$ which if $Y$ is strongly concentrated around $\sqrt{\frac{\pi m}{2}}$, will be $$E[X] \approx \sqrt{\frac{\pi m}{2}} + \sqrt{\frac{2m}{\pi}} = \sqrt{m}\left( \sqrt{\frac{\pi}2} + \sqrt{\frac2\pi} \right)$$ which ($E[X] \approx 2.05\sqrt{m}$) is slightly more than the answer given by the first approximation.
We know that $E[X]$ is between $\sqrt{\frac{\pi}{2}}\sqrt{m}$ and $\sqrt{2\pi}\sqrt{m}$, so we know for sure that asymptotically, $E[X] \sim c\sqrt{m}$ for some constant $c$. We just need to find the constant $c$, and it seems the best way is to write a program for it. Here's a Python program that uses the exact formula above to calculate the value of $E[X]$ for increasing $m$, and prints $\frac{E[X]}{\sqrt{m}}$:
which prints (among other things):
so $c \approx 1.88$ or $E[X] \approx 1.88\sqrt{m}$ seems to be the true statement.
We can say more about birthdays specifically ($m = 365$). With a minor change to the above program, for each $n$, we can calculate exactly the probability that $X = n$ (i.e., that it takes $n$ people until the second collision happens). This turns out to give the following distribution.
So as $P(X > 43)$ is still a very healthy $0.288763$, it's nothing to be very surprised about that there is still only one collision among the 43 US presidents so far. :-)