What follows is a computational contribution where we derive a closed
form (as opposed to an infinite series) of the expected number of
draws required to see all coupons at least twice when a number $n'$ of
coupons from the $n$ types where $n' < n$ have already been collected
in two instances. We then observe that the expectation does not
simplify. It seems like a rewarding challenge to compute the
asymptotics for these expectations using probabilistic methods and
compare them to the closed form presented below.
Using the notation from this MSE
link we have from
first principles that
$$P[T = m] = \frac{1}{n^m}\times {n-n'\choose 1}\times
(m-1)! [z^{m-1}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\frac{z}{1}.$$
We verify that this is a probability distribution. We get
$$\sum_{m\ge 2} P[T=m]
\\ = (n-n') \sum_{m\ge 2} \frac{1}{n^m}
(m-1)! [z^{m-2}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)! [z^{m}] \exp(n'z)
\left(\exp(z) - 1 - z\right)^{n-n'-1}
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)! [z^{m}] \exp(n'z)
\\ \times \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-n'-1-p)z)
(-1)^{p} (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times [z^{m}]
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-1-p)z)
(-1)^{p} (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\sum_{q=0}^{m}
[z^{m-q}] \exp((n-1-p)z)
(-1)^{p} [z^q] (1+z)^p
\\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m}
(m+1)!
\\ \times
\sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\sum_{q=0}^{m}
\frac{(n-1-p)^{m-q}}{(m-q)!}
(-1)^{p} {p\choose q}.$$
Re-arranging the order of the sums now yields
$$(n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\\ \times \sum_{m\ge 0} \frac{1}{n^m} (m+1)!
\sum_{q=0}^{m}
\frac{(n-1-p)^{m-q}}{(m-q)!}
(-1)^{p} {p\choose q}
\\ = (n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\\ \times
\sum_{q\ge 0} (-1)^{p} {p\choose q}
\sum_{m\ge q} \frac{1}{n^m} (m+1)!
\frac{(n-1-p)^{m-q}}{(m-q)!}.$$
Simplifying the inner sum we get
$$\frac{1}{n^q} \sum_{m\ge 0} \frac{1}{n^m} (m+q+1)!
\frac{(n-1-p)^{m}}{m!}
\\ = \frac{(q+1)!}{n^q} \sum_{m\ge 0} \frac{1}{n^m}
{m+q+1\choose q+1} (n-1-p)^m
\\ = \frac{(q+1)!}{n^q} \frac{1}{(1-(n-1-p)/n)^{q+2}}
= (q+1)! n^2 \frac{1}{(p+1)^{q+2}}.$$
We thus obtain for the sum of the probabilities
$$\sum_{m\ge 2} P[T=m] =
(n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p}
\sum_{q=0}^p {p\choose q}
(q+1)! \frac{1}{(p+1)^{q+2}}.$$
Repeat to instantly obtain for the expectation
$$\bbox[5px,border:2px solid #00A000]{
E[T] = n (n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p}
\sum_{q=0}^p {p\choose q}
\frac{(q+2)!}{(p+1)^{q+3}}.}$$
Now to simplify these we start with the inner sum from the probablity
using the fact that
$$\sum_{q=0}^p {p\choose q} (q+1)! \frac{1}{(p+1)^{q+1}} = 1$$
which was proved by residues at the cited link from the introduction.
We then obtain
$$(n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p}
\frac{(-1)^{p}}{p+1}
\\ = \sum_{p=0}^{n-n'-1} {n-n'\choose p+1} (-1)^p
= - \sum_{p=1}^{n-n'} {n-n'\choose p} (-1)^p
\\ = 1 - \sum_{p=0}^{n-n'} {n-n'\choose p} (-1)^p
= 1 - (1-1)^{n-n'} = 1$$
which confirms it being a probability distribution. We will not
attempt this manipulation with the expectation, since actual
computation of the values indicates that it does not simplify as
announced earlier. For example, these are the expectations for the
pairs $(2n', n'):$
$$4,11,{\frac {347}{18}},{\frac {12259}{432}},
{\frac {41129339}{1080000}},{\frac {390968681}{8100000}},
{\frac {336486120012803}{5717741400000}}, \ldots$$
and for pairs $(3n', n'):$
$${\frac {33}{4}},{\frac {12259}{576}},{\frac {390968681}{10800000}},
{\frac {2859481756726972261}{54646360473600000}}, \ldots$$
The reader who seeks numerical evidence confirming the closed form or
additional clarification of the problem definition used is asked to
consult the following simple C program whose output matched the
formula on all cases that were examined.
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
int main(int argc, char **argv)
{
int n = 6 , np = 3, j = 3, trials = 1000;
if(argc >= 2){
n = atoi(argv[1]);
}
if(argc >= 3){
np = atoi(argv[2]);
}
if(argc >= 4){
j = atoi(argv[3]);
}
if(argc >= 5){
trials = atoi(argv[4]);
}
assert(1 <= n);
assert(1 <= np && np < n);
assert(1 <= j);
assert(1 <= trials);
srand48(time(NULL));
long long data = 0;
for(int tind = 0; tind < trials; tind++){
int seen = np; int steps = 0;
int dist[n];
for(int cind = 0; cind < n; cind++){
if(cind < np)
dist[cind] = j;
else
dist[cind] = 0;
}
while(seen < n){
int coupon = drand48() * (double)n;
steps++;
if(dist[coupon] == j-1)
seen++;
dist[coupon]++;
}
data += steps;
}
long double expt = (long double)data/(long double)trials;
printf("[n = %d, np = %d, j = %d, trials = %d]: %Le\n",
n, np, j, trials, expt);
exit(0);
}
By inclusion–exclusion, after $n$ time steps collector $i$ has not finished with probability
$$
2\left(1-\frac15{p_i}\right)^n-\left(1-\frac25{p_i}\right)^n\;.
$$
Thus, all collectors have finished with probability
$$
\mathsf P(N\le n)=\prod_{i=1}^C\left(1-2\left(1-\frac15{p_i}\right)^n+\left(1-\frac25{p_i}\right)^n\right)\;,
$$
so the desired expectation of the number $N$ of time steps required is
\begin{eqnarray}
\mathsf E[N]
&=&
\sum_{n=0}^\infty\mathsf P(N\gt n)
\\
&=&
\sum_{n=0}^\infty\left(1-\prod_{i=1}^C\left(1-2\left(1-\frac15{p_i}\right)^n+\left(1-\frac25{p_i}\right)^n\right)\right)\;.
\end{eqnarray}
It’s not very enlightening to multiply this out for general $C$, but for concrete values of $C$ and $p_i$ you can multiply it out and sum all the resulting geometric series.
Best Answer
I eventually figured out this one. Every result in the question above is correct. It's just that the $X$ in equation (1) is the time at which all coupons will be collected if we assume that coupons arrive at a rate $\lambda=1$ in accordance with a Poisson process with each coupon arrival being type $j$ with probability $p_j$. Let $N$ be the number of coupons collected when the collection is completed. Then, we're interested in $E(N^2)$ and that is the expression equation (2) in the question is an expression for. So, we need to relate $E(X^2)$ with $E(N^2)$. First, as Ross notes,
$$E(X|N=n)=nE(T_i)$$
where $T_i$ are the inter-arrival times for coupon arrivals. Since these are assume to be exponential with rate 1,
$$E(X|N)=N\tag{1}$$
Taking expectations on both sides and using the law of total expectation we get:
$$E(X)=E(N)$$
Now, what about variance? Using the law of total variance we get:
$$V(X)=E(V(X|N))+V(E(X|N))$$
So per equation (1) we have:
$$V(X)=E(V(X|N))+V(N)\tag{2}$$
Now,
$$V(X|N)=NV(T_i)$$
And since $T_i \sim Exp(1)$, we have $V(T_i)=1$ meaning, $V(X|N)=N$.
Substituting into (2),
$$V(X)=E(N)+V(N)$$
And this extra $E(N)$ term on the LHS accounts for the missing term in the question.