[Math] Coupon collector’s problem: mean and variance in number of coupons to be collected to complete a set (unequal probabilities)

coupon-collectorprobability

There are $n$ coupons in a collection. A collector has the ability to purchase a coupon, but can't choose the coupon they purchase. Instead, the coupon is revealed to be coupon $i$ with probability $p_i=\frac 1 n$. Let $N$ be the number of coupons they'll need to collect before they have at least one coupon of each type. Find the expected value and variance of $N$. Bonus: generalize to the case where the probability of collecting the $j$th coupon is $p_j$ with $\sum\limits_{j=1}^n p_j=1$.

I recently came across this problem and came up with/ unearthed various methods to solve it. I'm intending this page as a wiki with various solutions. I'll be posting all the solutions I'm aware of (4 so far) over time.


EDIT: As mentioned in the comments, this question is different from the one people are saying it is a duplicate of since (for one thing) it includes an expression for the variance and it covers the general case where all coupons have unequal probabilities. The case of calculating the variance for the general case of coupons having unequal probabilities has not been covered anywhere on the site apart from an earlier post by me, which this one intends to consolidate along with other approaches to solve this problem.


EDIT: Paper on the solutions on this page submitted to ArXiv: http://arxiv.org/abs/2003.04720

Best Answer

A3: Using the Poisson process to magically concoct independent random variables. This is the most powerful of all approaches since it's the only one that allows us to solve for both mean and variance for the coupon collector's problem for the general case of coupons having unequal probabilities (and higher moments as well).

The other approaches either work for all moments but only the special case of equal probabilities (A1 and A2) or for the general case of unequal probabilities but only the mean (A4).

A question about this was asked and answered by me earlier: Coupon collectors problem: variance calculation missing a term.. This post is a consolidation.


In example 5.17 of the book, Introduction to probability models by Sheldon Ross, the Coupon collector's problem is tackled for the general case where the probability of drawing coupon $j$ is given by $p_j$ and of course, $\sum\limits_j p_j = 1$.

Now, he imagines that the collector collects the coupons in accordance to a Poisson process with rate $\lambda=1$. Furthermore, every coupon that arrives is of type $j$ with probability $p_j$.

Now, he defines $X_j$ as the first time a coupon of type $j$ is observed, if the $j$th coupon arrives in accordance to a Poisson process with rate $p_j$. We're interested in the time it takes to collect all coupons, $X$ (for now, eventually, we're interested in the number of coupons to be collected, $N$). So we get:

$$X = \max_{1\leq j \leq m}X_j$$

Note that if we denote $N_j$ as the number of coupons to be collected before the first coupon of type $j$ is seen, we also have for the number needed to collect all coupons, $N$:

$$N = \max_{1\leq j \leq m}N_j \tag{0}$$

This equation is less useful since the $N_j$ are not independent. It can still be used to get the mean (see answer A4), but trying to get the variance with this approach gets considerably more challenging due to this dependence of the underlying random variables.

But, the incredible fact that the $X_j$ are independent (discussion on that here), allows us to get:

$$F_X(t) = P(X<t) = P(X_j<t \; \forall \; j) = \prod\limits_{j=1}^{m}(1-e^{-p_j t})\tag{1}$$

Mean

Now, Ross uses the expression: $E(X) = \int\limits_0^\infty S_X(t)dt$, where $S_X(t)$ is the survival function to get:

$$E(X) = \int\limits_{0}^{\infty}\left(1-\prod\limits_{j=1}^{m}(1-e^{-p_j t})\right) dt $$

$$= \sum\limits_j\frac 1 p_j - \sum\limits_{i<j}\frac {1}{p_i+p_j} + \dots +(-1)^{m-1} \frac{1}{p_1+\dots+p_m}\tag{2}$$

For our case here we have: $p_j=\frac{1}{n} \forall j$

Substituting in the equation above we get:

$$E(X) = \sum\limits_{k=1}^{n}(-1)^k \frac{n \choose k}{k}$$

From here we get as before:

$$E(X) = n\sum\limits_{k=1}^n \frac{1}{k}$$

Further, Ross shows that $E(N)=E(X)$ using the law of total expectation.

First, he notes,

$$E(X|N=n)=nE(T_i)$$

where $T_i$ are the inter-arrival times for coupon arrivals. Since these are assume to be exponential with rate 1,

$$E(X|N)=N\tag{3}$$

Taking expectations on both sides and using the law of total expectation we get:

$$E(X)=E(N)$$

Variance

This approach can easily be extended to find $V(N)$, the variance (not covered by Ross). We can use the following expression to get $E(X^2)$:

$$E(X^2) = \int\limits_0^\infty 2tP(X>t)dt = \int\limits_0^\infty 2t\left(1-\prod\limits_{j=1}^n(1-e^{-p_j t})\right)dt$$

Using the fact that $\int\limits_0^\infty te^{-pt}=\frac{1}{p^2}$ and the same algebra as for $E(X)$ we get:

$$\frac{E(X^2)}{2} = \sum \frac {1} {p_j^2} -\sum_{i<j} \frac{1}{(p_i+p_j)^2}+\dots +(-1)^{n-1}\frac{1}{(p_1+\dots+p_n)^2} $$

Now, let's consider the special case where all coupons have an equal probability of being selected. In other words, $p_j=\frac 1 n \; \forall \; j$.

We get:

$$\frac{E(X^2)}{2} = n^2\left(\sum\limits_{k=1}^n (-1)^{k-1}\frac{n\choose k}{k^2}\right)$$

Per my answer to the question here, this summation yields:

$$E(X^2) = 2n^2\left( \sum_{j=1}^n\sum_{k=1}^j\frac{1}{jk}\right)\tag{4}$$

As a side-note, the binomial identity arising from the calculation of the second moment can be generalized: $$\sum_{k=1}^n(-1)^{k-1}\frac{n\choose k}{k^r}=\sum_{i_1<i_2<\dots <i_r}\frac{1}{i_1 i_2 \dots i_r}$$ See here.

Equation (4) has given us $E(X^2)$ but remember that we're interested in finding $E(N^2)$.

Using the law of total variance we get:

$$V(X)=E(V(X|N))+V(E(X|N))$$

So per equation (3) we have:

$$V(X)=E(V(X|N))+V(N)\tag{5}$$

Now,

$$V(X|N)=NV(T_i)$$

And since $T_i \sim Exp(1)$, we have $V(T_i)=1$ meaning, $V(X|N)=N$.

Substituting into (2),

$$V(X)=E(N)+V(N)$$

$$=> V(N)=E(X^2)-E(N)-E(N)^2$$

Where equation (4) gives $E(X^2)$ while $E(N)=n\sum_{k=1}^n \frac{1}{k}$ as shown multiple times on this page. This is consistent with equation (5) of A2.