Expected number of die tosses until a repeat occurs

expected valueprobability

I have the following problem: calculate the expected number of tosses of fair die until two faces of the die appear (not necessarily in a consecutive manner.) I've found a solution in this post, but there is no proof of how the recurrence is obtained. I found the problem in a section of a book, concerning conditional expectation, so my guess is that the formula for the expectation of a discrete random variable
$$ \text{E}(X) = \sum_{y} \text{E}(X| Y = y) \text{P}(Y = y) $$
must be used in some manner to obtain the recurrence, or to obtain the expected value in some other manner, but I have not been able to do so. I need some help obtaining a formal proof.

Best Answer

The explanation is given (albeit briefly) in the answer.

If $E_r$ represents the expected number of additional throws needed to observe a duplicate outcome when $r$ out of $n$ numbers have not yet been observed, then for instance, $E_0$ represents the expected number of additional throws needed when all $n$ numbers have been seen before. Thus $E_0 = 1$ because if you've seen every number already, then the next roll is guaranteed to show a duplicate.

So a natural question to ask is, what is $E_1$? If there is only one number you have not yet seen, then with probability $1/n$, you will roll that number next, after which you will be in state $r = 0$, since you'll have rolled all $n$ numbers. Otherwise, with probability $(n-1)/n$, you will roll a number you have already seen, upon which you will have achieved the stopping condition. So the expected number of additional throws you will need when there is only one number not yet seen is $$E_1 = \frac{1}{n} (1 + E_0) + \frac{n-1}{n} \cdot 1 = 1 + \frac{1}{n} E_0.$$ Note we have a $1$ in each case because that represents the roll we had to take.

So we repeat this logic with a general $r$. There are two cases: either we roll a number we did not see before, or we roll a duplicate and stop. If $r$ numbers have not yet been seen, then the probability of the first case is $r/n$, and the probability of the second case is $1 - r/n = (n-r)/n$, because these two cases are the only ones possible.

So $$E_r = \frac{r}{n}(1 + E_{r-1}) + \frac{n-r}{n} \cdot 1 = 1 + \frac{r}{n} E_{r-1},$$ because in the first case, we still have not rolled a duplicate but now there is one less number we've not yet seen; and the second, we rolled a duplicate and stopped.