Why is the expected number of draws given $k$ coupons so similar to the expected number of draws required to obtain $k$ coupons

conditional-expectationcoupon-collectorexpected valueprobability

Regarding the coupon collector’s problem, the question Expected size of collection based on number of uniques asks for the expected value of the number $N$ of draws made, given that $k$ out of $m$ unique coupons have been drawn. To make this well-defined, we need a prior for $N$. To answer the question, I assumed a uniform (improper) prior, found the corresponding posterior distribution for $N$ given $k$, and then computed its expected value using formal manipulations of a generating function. I was surprised to find that the result,

$$
\mathsf E[N\mid k]=m\sum_{r=1}^k\frac1{m-r}\;,
$$

is quite simple and very similar to the classical result for the number of draws required to obtain $k$ unique coupons,

$$
m\sum_{r=0}^{k-1}\frac1{m-r}\;,
$$

with the index merely shifted by $1$ and the simple difference $m\left(\frac1{m-k}-\frac1m\right)=\frac k{m-k}$.

It seems there must be a better explanation for this than formal manipulations of a generating function – perhaps something along the lines of the classical argument with geometric distributions that leads to the classical result – but I can’t think of one. Can you?

The mystery deepens a bit if you try to heuristically derive this result from the classical one. If we denote by $T_k$ the number of draws required to draw $k$ unique coupons, then we have $k$ unique coupons from $T_k$ until $T_{k+1}-1$. One might have thought that $E[N\mid k]$ would be something like the midpoint between the expected values of those two points, but it turns out to be expected value of the endpoint: $E[N\mid k]=E[T_{k+1}-1]$ (still assuming a uniform prior for $N$).

Best Answer

I figured this out now. The probability $p_n$ that $n$ is the last number of draws after which we have $k$ unique coupons is $\frac{m-k}m$ times the probability $q_n$ that we have $k$ unique coupons after $n$ draws (since this will be the last time iff we draw one of the unseen $m-k$ coupons). Since the $p_n$ are a probability distribution over $n$ and the $q_n$ are a constant multiple of them, normalizing the $q_n$ to obtain the posterior distribution of $N$ given $k$ yields the $p_n$. Thus the expected number of draws given $k$ unique coupons (assuming a uniform prior for $N$) is the expected value of the last number of draws after which we have $k$ unique coupons, which is $1$ less than the well-known expected number of draws required to get $k+1$ unique coupons, that is,

$$ \mathsf E[N\mid k]=\left(m\sum_{r=0}^k\frac1{m-r}\right)-1=m\sum_{r=1}^k\frac1{m-r}\;. $$

Related Solutions

[Math] Expiring coupon collector’s problem

This answer isn't rigorous in justifying approximations, but the result is confirmed numerically.

I'll call the $N$ different coupons colours to distinguish them more clearly from the coupons drawn.

Let $M=\alpha N\log N$, and consider the limit $N\to\infty$ for fixed $\alpha$. First, let's calculate the variance of the number of coupons drawn in the unmodified coupon collector's problem. As the expectation is obtained as the sum of the expectations of $N$ independent values, the variance is the sum of the variances of these values. The number of draws to get a new colour when $k$ colours are still missing is geometrically distributed with $p=k/N$ and thus expectation $1/p=N/k$ and variance $(1-p)/p^2=(N^2-kN)/k^2$. The sum of the expectations is the well-known result

$$ \sum_{k=1}^N\frac Nk=NH_N\sim N\log N\;, $$

where $H_N$ is the $N$-th harmonic number. The sum of the variances is

$$ \sum_{k=1}^N\frac{N^2-kN}{k^2}\sim\frac{\pi^2}6N^2-N\log N\sim\frac{\pi^2}6N^2\;. $$

Thus the standard deviation is asymptotically a fixed fraction $\pi/\sqrt6$ of $N$, and by Chebyshev's inequality for fixed $\alpha\gt1$ the process asymptotically almost surely ends before expiration sets in, so the expected number of coupons in this case is just the unmodified expected number $NH_N$.

On the other hand, for the same reason, for fixed $\alpha\lt1$ the process asymptotically almost surely doesn't end before expiration sets in, so the expected number of coupons in this case is $M$ plus the expected number of coupons drawn after the onset of expiration.

To estimate the latter, let's first estimate the probability that all $N$ colours are represented in $M$ uniformly independently drawn coupons. According to Byron's answer to this question, this is

$$ \sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M=\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^{\alpha N\log N}\;. $$

We can approximate this by

$$ \sum_{k=0}^N (-1)^k {N\choose k}\mathrm e^{-k\alpha\log N}=\sum_{k=0}^N (-1)^k {N\choose k}\left(N^{-\alpha}\right)^k=\left(1-N^{-\alpha}\right)^N\sim\exp\left(-N^{1-\alpha}\right) $$

for $N\to\infty$ if the terms of the series become negligible before the approximation breaks down. To check this, consider the logarithm of the absolute value of the (approximated) terms,

$$ \log\left(\binom Nk\mathrm e^{-k\alpha\log N}\right)\approx N\log N-k\log k-(N-k)\log(N-k)-k\alpha\log N\;, $$

and set the derivative with respect to $k$ to zero:

$$ -\log k+\log(N-k)-\alpha\log N=0 $$

to find the maximum at $k=N/(1+N^\alpha)$. Thus for $N\to\infty$ the maximum shifts towards vanishing fractions of $N$, and the approximation is asymptotically valid.

Now a first estimate of the expected number of coupons drawn after the onset of expiration would be $\exp\left(N^{1-\alpha}\right)$, the result if at every draw the $M$ unexpired coupons were independent of the ones at previous draws. This already exhibits the desired feature of interpolating between exponential behaviour for $\alpha\to0$ and $N\log N$ behaviour for $\alpha\to1$. (Remember that $M=\alpha N\log N$ gets added to this to obtain the total expected number of coupons.)

To improve the estimate, we need to condition on the previous batches not containing all colours. Since asymptotically a batch almost surely doesn't contain all colours, the denominator in the definition of the conditional probability tends to $1$, and the probability for the current batch to contain all colours conditioned on the previous batches not containing all colours is asymptotically equal to the probability that the current batch contains all colours and the previous batches didn't.

The most important part of the condition, which is independent of the colours of recently expired coupons, is simply that the $M-1$ unexpired coupons we already had last time don't contain all $N$ colours. The probability that $M$ coupons contain all $N$ colours but the first $M-1$ of them don't is

$$ \begin{align} &\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M-\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^{M-1} \\ \sim&\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M-\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M\left(1+{k\over N}\right) \\ =&\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M\left(-\frac kN\right) \\ \sim&\sum_{k=0}^N (-1)^k {N\choose k}\left(N^{-\alpha}\right)^k\left(-\frac kN\right) \\ =&N^{-\alpha}\left(1-N^{-\alpha}\right)^{N-1} \\ \sim&N^{-\alpha}\exp\left(-N^{1-\alpha}\right)\;. \end{align} $$

Thus we obtain an improved estimate for the expected number of draws after the onset of expiration, $N^{\alpha}\exp\left(N^{1-\alpha}\right)$. In fact this will turn out to be asymptotically correct, but we need to check the effect of the conditions implied by the colours of the recently expired coupons.

To do so, imagine the drawing process reversed in time, with recently drawn coupons being removed and recently expired coupons being added. We can interpret the above calculation to show that, conditional on all $M$ coupons containing all $N$ colours, removing one coupon has a probability of $1-N^{-\alpha}$ of removing a unique colour, whereas with probability $N^{-\alpha}$ all colours remain represented. This result remains valid if we remove further coupons; the changes in $M$ and $N$ by $O(1)$ only change the result by a factor $1+O(N^{-1})$. Thus, asymptotically, conditional on all $M$ coupons containing all $N$ colours, each removed recently drawn coupon independently has a probability of $1-N^{-\alpha}$ of reducing the number of colours represented by one.

On the other hand, the recently expired coupons are not affected by the condition that our current set of coupons contains all colours, so the probability of regaining a particular missing colour by adding a recently expired coupon back in is just $1-N^{-1}$.

With this model, we can obtain a systematic expansion of the steady-state probability of completing the colours on a given draw, by considering increasing numbers of missing colours. I'll show the calculation for one additional missing colour, which is straightforward and demonstrates that the corrections don't affect the asymptotic behaviour.

We know that one colour immediately goes missing when we remove the coupon just drawn. Let $j+1$ be the number of recently drawn coupons we need to remove beyond that to lose another colour, and let $l+1$ be the number of expired coupons we have to recoup to replace the colour of the coupon just drawn. Then this history is excluded if $l\le j$, since in that case the colour just drawn gets replaced before another one goes missing, implying a full set of $N$ colours in the past. Thus we want the fraction of histories for which $l\gt j$. This is

$$ \begin{align} &\sum_{j=0}^\infty N^{-\alpha}\left(1-N^{-\alpha}\right)^j\sum_{l=j+1}^\infty N^{-1}\left(1-N^{-1}\right)^l \\ =&\sum_{j=0}^\infty N^{-\alpha}\left(1-N^{-\alpha}\right)^j\left(1-N^{-1}\right)^{j+1} \\ \sim&\frac{N^{-\alpha}}{N^{-\alpha}+N^{-1}} \\ =& \frac1{1+N^{\alpha-1}}\;. \end{align} $$

Multiplying this by the probability $N^{-\alpha}\exp\left(-N^{1-\alpha}\right)$ and taking the reciprocal yields an improved estimate for the expected number of coupons drawn after the onset of expiration, $N^\alpha\exp\left(N^{1-\alpha}\right)\left(1+N^{\alpha-1}\right)$. Note that the correction doesn't affect the asymptotic behaviour, since $1+N^{\alpha-1}\sim1$.

I also carried out the calculations for two and three additional missing colours, which are a bit more involved. I won't bore you with the details; the result is that the expected number of coupons is multiplied by rational functions of $N^{\alpha-1}$ that go to $1$ for $N^{\alpha-1}\to0$. The expansion only seems to converge for rather small values of $N^{\alpha-1}$, but that doesn't matter asymptotically.

Thus, the analysis suggests that the expected number of coupons drawn after the onset of expiration is asymptotic to $N^{\alpha}\exp\left(N^{1-\alpha}\right)$. This is difficult to test numerically for most $\alpha$, since for $\alpha$ close to $1$ the expansion in $N^{1-\alpha}$ converges very slowly and for $\alpha$ close to $0$ the expected number of draws is prohibitively large. A reasonable compromise is $\alpha=0.8$, for which the following table shows the average number of coupons drawn after the onset of expiration in $5000$ runs for $N=10\cdot2^n$ with $n=0,\dotsc,12$ and $M$ the closest integer to $0.8N\log N$. Also shown is the ratio to the asymptotic result $N^{\alpha}\exp\left(N^{1-\alpha}\right)$ and to the result of the first-order correction, $N^{\alpha}\exp\left(N^{1-\alpha}\right)\left(1+N^{\alpha-1}\right)$. Both ratios appear to be approaching $1$, the corrected one more quickly.

$$ \begin{array}{r|r|r|r|r|r|r} N&M&\langle\text{#draws}\rangle&N^{0.8}\exp(N^{0.2})&\cdot\,(1+N^{-0.2})&\text{ratio}&\text{corrected}\\\hline 10 & 18 & 28 & 31 & 50 & 0.9115 & 0.5589\\ 20 & 48 & 62 & 68 & 105 & 0.9196 & 0.5936\\ 40 & 118 & 158 & 155 & 229 & 1.0226 & 0.6918\\ 80 & 280 & 428 & 368 & 521 & 1.1638 & 0.8217\\ 160 & 650 & 1097 & 916 & 1247 & 1.1976 & 0.8790\\ 320 & 1477 & 3019 & 2403 & 3161 & 1.2563 & 0.9550\\ 640 & 3308 & 8994 & 6703 & 8544 & 1.3418 & 1.0527\\ 1280 & 7326 & 25913 & 20055 & 24850 & 1.2921 & 1.0428\\ 2560 & 16072 & 85089 & 65037 & 78573 & 1.3083 & 1.0829\\ 5120 & 34984 & 294659 & 231341 & 273258 & 1.2737 & 1.0783\\ 10240 & 75645 & 1122292 & 915127 & 1059479 & 1.2264 & 1.0593\\ 20480 & 162647 & 4998493 & 4089855 & 4651474 & 1.2222 & 1.0746\\ 40960 & 348008 & 24025351 & 21028673 & 23542526 & 1.1425 & 1.0205\\ \end{array} $$

Here's the code I used to produce the table.

[Math] Coupon collector problem with $k$ distinct coupon sets to complete

This can be solved using inclusion-exclusion. There are $\binom kj$ ways to choose $j$ particular sets to finish, and the probability to have completed all $j$ of them is the probability to have completed a standard coupon collection with $jn$ coupons while drawing from $kn$ coupon types. Since the expected number of draws is the sum of the non-completion probabilities over all times, it satisfies the same inclusion-exclusion relation as the probabilities. Drawing from $kn$ coupon types while collecting $jn$ increases the expected number of draws by a factor $\frac kj$. Thus the desired expectation is

\begin{align} \sum_{j=1}^k(-1)^{j-1}\binom kj\frac kjjnH_{jn} &=kn\sum_{j=1}^k(-1)^{j-1}\binom kjH_{jn} \\ &=kn\sum_{j=1}^k(-1)^{j-1}\binom kj\left(\log j+\log n+\gamma+\frac1{2jn}\right)+O\left(\frac kn\right)\\ &=kn\left(\log n+\gamma\right)+\frac12kH_k+kn\sum_{j=1}^k(-1)^{j-1}\binom kj\log j+O\left(\frac kn\right)\\ &=knH_n+\frac12kH_k-k+kn\sum_{j=1}^k(-1)^{j-1}\binom kj\log j+O\left(\frac kn\right)\;. \end{align}

For the example $n=10$, $k=2$ used in Tad's answer, this yields the approximation

$$ 20\left(H_{10}-\log2\right)+H_2-2\approx44.2164\;, $$

close to Tad's approximation.

The remaining sum is treated in Proof $\sum\limits_{k=1}^n \binom{n}{k}(-1)^k \log k = \log \log n + \gamma +\frac{\gamma}{\log n} +O\left(\frac1{\log^2 n}\right)$; substituting that expansion leads to

$$ kn\left(H_n-\log\log k-\gamma-\frac\gamma{\log k}+\frac{\pi^2+6\gamma^2}{12\log^2k}\right)+\frac12kH_k-k+O\left(\frac{kn}{\log^3k}\right)\;. $$

Best Answer

Related Solutions

[Math] Expiring coupon collector’s problem

[Math] Coupon collector problem with $k$ distinct coupon sets to complete

Related Question