$\def\m{\boldsymbol\mu}\def\p{\mathbf p}$
The expected value in question reads:
$$
\mathbb E=\sum_i\frac1{p_i}-\sum_{(i,j)}\frac1{p_i+p_j}+\sum_{(i,j,k)}\frac1{p_i+p_j+p_k}-\cdots-\frac{(-1)^n}{\underbrace{p_1+p_2+\cdots+p_n}_{=1}}.\tag1
$$
It can be written in a compact notation as:
$$
\mathbb E=\sum_{\m\ne0}\frac{(-1)^{|\m|-1}}{\m\cdot\p},\tag2
$$
where $\m$ are $2^n$ binary vectors of the length $n$ with $|\m|=\sum_{i=1}^n\mu_i$, and $\p=(p_1,p_2,\dots, p_n)$. The summation in (2) runs over all $\m$ except for $(0,0,\dots,0)$.
What follows is a computational contribution where we derive a closed
form (as opposed to an infinite series) of the expected number of
draws required to see all coupons at least twice when a number $n'$ of
coupons from the $n$ types where $n' < n$ have already been collected
in two instances. We then observe that the expectation does not
simplify. It seems like a rewarding challenge to compute the
asymptotics for these expectations using probabilistic methods and
compare them to the closed form presented below.
Using the notation from this MSE
link we have from
first principles that
$$P[T = m] = \frac{1}{n^m}\times {n-n'\choose 1}\times
(m-1)! [z^{m-1}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\frac{z}{1}.$$
We verify that this is a probability distribution. We get
$$\sum_{m\ge 2} P[T=m]
\\ = (n-n') \sum_{m\ge 2} \frac{1}{n^m}
(m-1)! [z^{m-2}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)! [z^{m}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)! [z^{m}] \exp(n'z)
\\ \times \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-n'-1-p)z)
(-1)^{p} (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times [z^{m}]
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-1-p)z)
(-1)^{p} (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\sum_{q=0}^{m}
[z^{m-q}] \exp((n-1-p)z)
(-1)^{p} [z^q] (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\sum_{q=0}^{m}
\frac{(n-1-p)^{m-q}}{(m-q)!}
(-1)^{p} {p\choose q}.$$
Re-arranging the order of the sums now yields
$$(n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\\ \times \sum_{m\ge 0} \frac{1}{n^m} (m+1)!
\sum_{q=0}^{m}
\frac{(n-1-p)^{m-q}}{(m-q)!}
(-1)^{p} {p\choose q}
\\ = (n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\\ \times
\sum_{q\ge 0} (-1)^{p} {p\choose q}
\sum_{m\ge q} \frac{1}{n^m} (m+1)!
\frac{(n-1-p)^{m-q}}{(m-q)!}.$$
Simplifying the inner sum we get
$$\frac{1}{n^q} \sum_{m\ge 0} \frac{1}{n^m} (m+q+1)!
\frac{(n-1-p)^{m}}{m!}
\\ = \frac{(q+1)!}{n^q} \sum_{m\ge 0} \frac{1}{n^m}
{m+q+1\choose q+1} (n-1-p)^m
\\ = \frac{(q+1)!}{n^q} \frac{1}{(1-(n-1-p)/n)^{q+2}}
= (q+1)! n^2 \frac{1}{(p+1)^{q+2}}.$$
We thus obtain for the sum of the probabilities
$$\sum_{m\ge 2} P[T=m] =
(n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p}
\sum_{q=0}^p {p\choose q}
(q+1)! \frac{1}{(p+1)^{q+2}}.$$
Repeat to instantly obtain for the expectation
$$\bbox[5px,border:2px solid #00A000]{
E[T] = n (n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p}
\sum_{q=0}^p {p\choose q}
\frac{(q+2)!}{(p+1)^{q+3}}.}$$
Now to simplify these we start with the inner sum from the probablity
using the fact that
$$\sum_{q=0}^p {p\choose q} (q+1)! \frac{1}{(p+1)^{q+1}} = 1$$
which was proved by residues at the cited link from the introduction.
We then obtain
$$(n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\frac{(-1)^{p}}{p+1}
\\ = \sum_{p=0}^{n-n'-1} {n-n'\choose p+1} (-1)^p
= - \sum_{p=1}^{n-n'} {n-n'\choose p} (-1)^p
\\ = 1 - \sum_{p=0}^{n-n'} {n-n'\choose p} (-1)^p
= 1 - (1-1)^{n-n'} = 1$$
which confirms it being a probability distribution. We will not
attempt this manipulation with the expectation, since actual
computation of the values indicates that it does not simplify as
announced earlier. For example, these are the expectations for the
pairs $(2n', n'):$
$$4,11,{\frac {347}{18}},{\frac {12259}{432}},
{\frac {41129339}{1080000}},{\frac {390968681}{8100000}},
{\frac {336486120012803}{5717741400000}}, \ldots$$
and for pairs $(3n', n'):$
$${\frac {33}{4}},{\frac {12259}{576}},{\frac {390968681}{10800000}},
{\frac {2859481756726972261}{54646360473600000}}, \ldots$$
The reader who seeks numerical evidence confirming the closed form or
additional clarification of the problem definition used is asked to
consult the following simple C program whose output matched the
formula on all cases that were examined.
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
int main(int argc, char **argv)
{
int n = 6 , np = 3, j = 3, trials = 1000;
if(argc >= 2){
n = atoi(argv[1]);
}
if(argc >= 3){
np = atoi(argv[2]);
}
if(argc >= 4){
j = atoi(argv[3]);
}
if(argc >= 5){
trials = atoi(argv[4]);
}
assert(1 <= n);
assert(1 <= np && np < n);
assert(1 <= j);
assert(1 <= trials);
srand48(time(NULL));
long long data = 0;
for(int tind = 0; tind < trials; tind++){
int seen = np; int steps = 0;
int dist[n];
for(int cind = 0; cind < n; cind++){
if(cind < np)
dist[cind] = j;
else
dist[cind] = 0;
}
while(seen < n){
int coupon = drand48() * (double)n;
steps++;
if(dist[coupon] == j-1)
seen++;
dist[coupon]++;
}
data += steps;
}
long double expt = (long double)data/(long double)trials;
printf("[n = %d, np = %d, j = %d, trials = %d]: %Le\n",
n, np, j, trials, expt);
exit(0);
}
Best Answer
A3: Using the Poisson process to magically concoct independent random variables. This is the most powerful of all approaches since it's the only one that allows us to solve for both mean and variance for the coupon collector's problem for the general case of coupons having unequal probabilities (and higher moments as well).
The other approaches either work for all moments but only the special case of equal probabilities (A1 and A2) or for the general case of unequal probabilities but only the mean (A4).
A question about this was asked and answered by me earlier: Coupon collectors problem: variance calculation missing a term.. This post is a consolidation.
In example 5.17 of the book, Introduction to probability models by Sheldon Ross, the Coupon collector's problem is tackled for the general case where the probability of drawing coupon $j$ is given by $p_j$ and of course, $\sum\limits_j p_j = 1$.
Now, he imagines that the collector collects the coupons in accordance to a Poisson process with rate $\lambda=1$. Furthermore, every coupon that arrives is of type $j$ with probability $p_j$.
Now, he defines $X_j$ as the first time a coupon of type $j$ is observed, if the $j$th coupon arrives in accordance to a Poisson process with rate $p_j$. We're interested in the time it takes to collect all coupons, $X$ (for now, eventually, we're interested in the number of coupons to be collected, $N$). So we get:
$$X = \max_{1\leq j \leq m}X_j$$
Note that if we denote $N_j$ as the number of coupons to be collected before the first coupon of type $j$ is seen, we also have for the number needed to collect all coupons, $N$:
$$N = \max_{1\leq j \leq m}N_j \tag{0}$$
This equation is less useful since the $N_j$ are not independent. It can still be used to get the mean (see answer A4), but trying to get the variance with this approach gets considerably more challenging due to this dependence of the underlying random variables.
But, the incredible fact that the $X_j$ are independent (discussion on that here), allows us to get:
$$F_X(t) = P(X<t) = P(X_j<t \; \forall \; j) = \prod\limits_{j=1}^{m}(1-e^{-p_j t})\tag{1}$$
Now, Ross uses the expression: $E(X) = \int\limits_0^\infty S_X(t)dt$, where $S_X(t)$ is the survival function to get:
$$E(X) = \int\limits_{0}^{\infty}\left(1-\prod\limits_{j=1}^{m}(1-e^{-p_j t})\right) dt $$
$$= \sum\limits_j\frac 1 p_j - \sum\limits_{i<j}\frac {1}{p_i+p_j} + \dots +(-1)^{m-1} \frac{1}{p_1+\dots+p_m}\tag{2}$$
For our case here we have: $p_j=\frac{1}{n} \forall j$
Substituting in the equation above we get:
$$E(X) = \sum\limits_{k=1}^{n}(-1)^k \frac{n \choose k}{k}$$
From here we get as before:
$$E(X) = n\sum\limits_{k=1}^n \frac{1}{k}$$
Further, Ross shows that $E(N)=E(X)$ using the law of total expectation.
First, he notes,
$$E(X|N=n)=nE(T_i)$$
where $T_i$ are the inter-arrival times for coupon arrivals. Since these are assume to be exponential with rate 1,
$$E(X|N)=N\tag{3}$$
Taking expectations on both sides and using the law of total expectation we get:
$$E(X)=E(N)$$
This approach can easily be extended to find $V(N)$, the variance (not covered by Ross). We can use the following expression to get $E(X^2)$:
$$E(X^2) = \int\limits_0^\infty 2tP(X>t)dt = \int\limits_0^\infty 2t\left(1-\prod\limits_{j=1}^n(1-e^{-p_j t})\right)dt$$
Using the fact that $\int\limits_0^\infty te^{-pt}=\frac{1}{p^2}$ and the same algebra as for $E(X)$ we get:
$$\frac{E(X^2)}{2} = \sum \frac {1} {p_j^2} -\sum_{i<j} \frac{1}{(p_i+p_j)^2}+\dots +(-1)^{n-1}\frac{1}{(p_1+\dots+p_n)^2} $$
Now, let's consider the special case where all coupons have an equal probability of being selected. In other words, $p_j=\frac 1 n \; \forall \; j$.
We get:
$$\frac{E(X^2)}{2} = n^2\left(\sum\limits_{k=1}^n (-1)^{k-1}\frac{n\choose k}{k^2}\right)$$
Per my answer to the question here, this summation yields:
$$E(X^2) = 2n^2\left( \sum_{j=1}^n\sum_{k=1}^j\frac{1}{jk}\right)\tag{4}$$
As a side-note, the binomial identity arising from the calculation of the second moment can be generalized: $$\sum_{k=1}^n(-1)^{k-1}\frac{n\choose k}{k^r}=\sum_{i_1<i_2<\dots <i_r}\frac{1}{i_1 i_2 \dots i_r}$$ See here.
Equation (4) has given us $E(X^2)$ but remember that we're interested in finding $E(N^2)$.
Using the law of total variance we get:
$$V(X)=E(V(X|N))+V(E(X|N))$$
So per equation (3) we have:
$$V(X)=E(V(X|N))+V(N)\tag{5}$$
Now,
$$V(X|N)=NV(T_i)$$
And since $T_i \sim Exp(1)$, we have $V(T_i)=1$ meaning, $V(X|N)=N$.
Substituting into (2),
$$V(X)=E(N)+V(N)$$
$$=> V(N)=E(X^2)-E(N)-E(N)^2$$
Where equation (4) gives $E(X^2)$ while $E(N)=n\sum_{k=1}^n \frac{1}{k}$ as shown multiple times on this page. This is consistent with equation (5) of A2.