Coupon Collector’s Problem – Most Collected Coupon Analysis

combinatoricscoupon-collectorexpected valueprobability

Suppose a coupon collector is collecting a set of $n$ coupons that he receives one-by-one uniformly at random.

If the collector stops exactly when the collection is complete, we know the expected number of coupons in his collection is $n*H[n]$. What is the expected number of copies, $M$, of his most collected coupon?

When $n = 2$, then $M = 2$ because after the first coupon he will collect the other coupon with probability p ~ Geom(1/2). So he will collect the same coupon one additional time on average, and that coupon is certain to be his most collected coupon.

I don't know the exact value for any $n$ larger than 2, but found some approximate values by simulation:

n M
3 2.8415
4 3.4992
5 4.0259
6 4.4633
7 4.8377
8 5.1649
9 5.4560

EDIT 1:

For small n enumerating the small possibilities converges faster than simulation, and from this approach I hypothesize that $M[3] = (15 + 6 \sqrt{5})/10$ but don't have any real argument to support that claim.

EDIT 2:

With some more thought, I find that M has an explicit sum formula. The probability the collection terminates with a given distribution of coupons $v = \{c_1, …, c_{n-1}\}$ other than the final collected coupon is just the multinomial coefficient $(c_1; …; c_{n-1})$ over $n^\text{total # of coupons in the collection}$. This is an infinite sum over n-1 variables. Here is Mathematica code the computes the sum for terms where the collection has no more than k copies of any coupon:

M[n_, k_] := Sum[Max[v]*(Multinomial @@ v)/n^Total[v], {v, Tuples[Range[1, k], n-1]}]

Mathematica can't actually evaluate this though for $k \rightarrow \infty$ though.

Best Answer

Value of $M_3$. I will denote $M$ as $M_n$ to emphasize the dependence on $n$. Then by using A230137, we get

\begin{align*} M_3 &= \sum_{n=2}^{\infty} \frac{\sum_{k=1}^{n-1} \binom{n}{k} \max\{k,n-k\}}{3^n} \\ &= \sum_{n=1}^{\infty} \frac{\left( 4^n + \binom{2n}{n} \right)n - 2(2n)}{3^{2n}} + \sum_{n=1}^{\infty} \frac{\left( 4^n + \binom{2n}{n} \right)(2n+1) - 2(2n+1)}{3^{2n+1}} \\ &= 3\left(\frac{1}{2} + \frac{1}{\sqrt{5}}\right) \\ &\approx 2.84164, \end{align*}

which is the same as what OP conjectured.


Asymptotic Formula of $M_n$. A heuristic seems suggesting that

$$M_n \sim e \log n \qquad \text{as} \quad n \to \infty,$$

The figure below compares simulated values of $M_n$ and the values of the asymptotic formula.

simulations

Heuristic argument. First, we consider the Poissonized version of the problem:

1. Let $N^{(1)}, \ldots, N^{(n)}$ be independent Poisson processes with rate $\frac{1}{n}$. Then the arrivals in each $N^{(i)}$ model the times when the collector receives coupons of type $i$.

2. Let $T$ be the first time a complete set of coupons is collected. Then we know that $T$ has the same distribution as the sum of $n$ independent exponential RVs with respective rates $1$, $\frac{n-1}{n}$, $\ldots$, $\frac{1}{n}$. (This is an example of the exponential race problem.) For large $n$, this is asymptotically normal with mean $\sum_{i=1}^{n}\frac{n}{i} \sim n \log n$ and variance $\sum_{i=1}^{n}\frac{n^2}{i^2} \sim \zeta(2)n^2$. This tells that $T \sim n \log n$ in probability, and so, it is not unreasonable to expect that

$$ M_n = \mathbf{E}\Bigl[ \max_{1\leq i\leq n} N^{(i)}_T \Bigr] \sim \mathbf{E}\Bigl[ \max_{1\leq i\leq n} N^{(i)}_{n\log n} \Bigr]. \tag{1} $$

holds as well.

3. Note that each $N^{(i)}_{n\log n}$ has a Poisson distribution with rate $\log n$. Now, we fix $\theta > 1$ and choose an integer sequence $(a_n)$ such that $a_n > \log n$ and $a_n/\log n \to \theta$ as $n \to \infty$. Then

\begin{align*} \mathbf{P}\Bigl( N^{(i)}_{n\log n} > a_n \Bigr) &\asymp \frac{(\log n)^{a_n}}{a_n!}e^{-\log n} \\ &\asymp \frac{(\log n)^{a_n}}{(a_n)^{a_n} e^{-a_n+\log n} \sqrt{a_n}} \\ &\asymp \frac{1}{n^{\psi(\theta) + 1 + o(1)}}, \qquad \psi(\theta) = \theta \log \theta - \theta, \end{align*}

where $f(n) \asymp g(n)$ means that $f(n)/g(n)$ is bounded away from both $0$ and $\infty$ as $n\to\infty$. This shows that

$$ \mathbf{P}\Bigl( \max_{1\leq i\leq n} N^{(i)}_{n\log n} \leq a_n \Bigr) = \biggl[ 1 - \frac{1}{n^{\psi(\theta) + 1 + o(1)}} \biggr]^n \xrightarrow[n\to\infty]{} \begin{cases} 0, & \psi(\theta) < 0, \\[0.5em] 1, & \psi(\theta) > 0. \end{cases} $$

So, by noting that $\psi(e) = 0$, we get

$$ \max_{1\leq i\leq n} N^{(i)}_{n\log n} \sim e \log n \qquad \text{in probability}. \tag{2} $$

This suggests that $M_n \sim e \log n$ as well.

4. I believe that this heuristic argument can be turned into an actual proof. To this, we need the following:

  1. We have to justify the relation $\text{(1)}$ in an appropriate sense. We may perhaps study the inequality

    $$ \max_{1\leq i\leq n} N^{(i)}_{(1-\varepsilon)n\log n} \leq \max_{1\leq i\leq n} N^{(i)}_T \leq \max_{1\leq i\leq n} N^{(i)}_{(1+\varepsilon)n\log n} \qquad \text{w.h.p.} $$

    and show that the probability of the exceptional event is small enough to bound the difference of both sides of $\text{(1)}$ by a vanishing qantity.

  2. Note that $\text{(2)}$ alone is not enough to show this. So, we need to elaborate the argument in step 3 so that it can actually prove the relation $\mathbf{E}\bigl[ \max_{1\leq i\leq n} N^{(i)}_{n\log n} \bigr] \sim e \log n$.