[Math] Coupon Collector’s problem to obtain at least half the coupons

balls-in-binscoupon-collectorprobability

I am looking at the Coupon Collector's problem. The classical result is that we need to collect $n \ln(n) + O(n)$ on average to have at least one coupon of each type.

I am searching for the number of coupons we need to collect to have at least $\frac{n}{2}$ different coupons, and I have a doubt on how to proceed.

Viewed as a balls and bins problem, this is equivalent to determining the number of balls to throw in order to have less than $\frac{n}{2}$ empty bins. If we throw $m$ balls into $n$ bins, it is know that the expected number of empty bins will be smaller or equal to $e^{-m/n}$. Hence by letting $m = \ln(2) \cdot n$ I have what I want.

However, if I try to solve it like in [1] or [2], I obtain that the number of balls is on average $nH_{n/2} \approx n \ln(n/2) + O(n)$ where $H_{n}$ is the $n$-th harmonic number.

Another approach is to count all the solutions that are acceptable ($n/2$, $n/2 + 1$, …, $n$ non-empty bins) yielding $n(H_n – H_{n/2}) \approx n\ln(2) + O(n)$ which is what I want.

Which approach is correct?

[1] Mitzenmacher, Michael, and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge university press, 2005.
[2] Motwani, Rajeev, and Prabhakar Raghavan. Randomized algorithms. Chapman & Hall/CRC, 2010.

Best Answer

I believe the answer is indeed asymptotic to $n\ln2$.

A traditional way to solve the original problem is this: let $X_k$ be the random variable that describes, once we have $k$ distinct coupons, how many further coupons we need to obtain before we have $k+1$ distinct coupons. Since the probability that a new coupon is different from the first $k$ coupons is $1-\frac{k}n$, the distribution of $X_k$ is geometric with parameter $(1-\frac{k}n)^{-1} = \frac n{n-k}$, and therefore has expectation $\frac n{n-k}$. The total amount of time it takes to have all $n$ distinct coupons is $\sum_{k=0}^{n-1} X_k$; by linearity of expectation (we don't even use the fact that the $X_k$ are independent), the expected time is $\sum_{k=0}^{n-1} \frac n{n-k} = nH_n$.

But if we only want to measure the time to get half the coupons, the relevant sum is now $\sum_{k=0}^{n/2-1} \frac n{n-k} = nH_n - nH_{n/2} \sim n\ln2$.

Related Question