[Math] Coupon Collector Problem with multiple copies and X amount of coupons already collected

coupon-collectorprobability

I have a variant of the coupon collector problem where there are $n$ different coupons being collected, which are being drawn with equal probability and with replacement, but 2 copies of each coupon is needed. Now say I have a total of $X$ number of relevant coupons collected of the $2n$ total number of coupons for a full set, and want to know how many extra unneeded coupons you should have. What would be an equation for that?
I have found on Wikipedia an equation for the number of draws needed to get a full set of one of each coupon, and number of draws needed to get a full set of multiple copies of each coupon.

$E(T) = n\log{n} + γn + {\frac{1}{2}} + O({\frac{1}{n}})$, where $\gamma \approx 0.5772156649$

$E(Tm) = n\log{n} + (m-1)n\log\log{n} + O(n)$, as $n→∞$

Where $m$ is the number of copies of each coupon to be collected, so 2 in this case, and $Tm$ is the first time $m$ copies of each coupon are collected.

I also found this from a previous question. Coupon Collector's Problem with X amount of coupons already collected.

The probability $p_i$ of selecting a new coupon thus equals $\frac{n-i+1}{n}$, and the expected number of draws needed to draw a new coupon equals $\frac{1}{p_i} = \frac{n}{n-i+1}$. As such, the expected value for the time needed to draw all $n$ coupons can be calculated as:

$$E[T] = \frac{n}{n} + \frac{n}{n-1} + \ldots + \frac{n}{1} = n \sum_{k=1}^{n}{\frac{1}{k}}$$

In this case, however, we have already drawn $X$ unique coupons. As such, the estimated number of draws needed to find all $n$ coupons equals:

$$E[T] = E[t_{X+1}] + E[t_{X+2}] + \ldots + E[t_n] = n \sum_{k=1}^{n-X} \frac{1}{k}$$

So to find the total number of drawn coupons upon collecting the $X^{th}$ unique coupon, the equation would be
$$n \sum_{k=1}^{n}{\frac{1}{k}}-n \sum_{k=1}^{n-X} \frac{1}{k} $$

The total number of just unneeded duplicate coupons would be
$$n \sum_{k=1}^{n}{\frac{1}{k}}- n \sum_{k=1}^{n-X} \frac{1}{k} – X $$

I'm not sure how to combine these two equations, $E(Tm) = n\log{n} + (m-1)n\log\log{n} + O(n)$, as $n→∞$ and $n \sum_{k=1}^{n}{\frac{1}{k}}-n \sum_{k=1}^{n-X} \frac{1}{k}$, to get the total number of coupons drawn upon having X number of relevant coupons collected towards a full collection of 2 of each coupon. Then with that equation just subtract X (the number of relevant coupons collected) from the total number to get the total unneeded coupons. I will admit I’m not in a math profession or have done any higher level math lately, but if I understand correctly the first equation is more of an approximation of the value, while the second equation is more exact. So I'm not sure how to combine them or if they can be easily combined.

Also I somewhat understand $O(n)$ and its use, I’m not sure how to input it with the rest of the equation into wolframalpha or even excel, the end goal of where I want to us this equation. The maximum number of coupons would be about 100 if that helps. If it would be easier it, the total number of coupons that have only 1 copy and number of coupons that have 2 copies collected could be used as an input instead of the total number of relevant coupons collected.

Best Answer

What follows is a computational contribution where we derive a closed form (as opposed to an infinite series) of the expected number of draws required to see all coupons at least twice when a number $n'$ of coupons from the $n$ types where $n' < n$ have already been collected in two instances. We then observe that the expectation does not simplify. It seems like a rewarding challenge to compute the asymptotics for these expectations using probabilistic methods and compare them to the closed form presented below.

Using the notation from this MSE link we have from first principles that

$$P[T = m] = \frac{1}{n^m}\times {n-n'\choose 1}\times (m-1)! [z^{m-1}] \exp(n'z) \left(\exp(z) - 1 - z\right)^{n-n'-1} \frac{z}{1}.$$

We verify that this is a probability distribution. We get

$$\sum_{m\ge 2} P[T=m] \\ = (n-n') \sum_{m\ge 2} \frac{1}{n^m} (m-1)! [z^{m-2}] \exp(n'z) \left(\exp(z) - 1 - z\right)^{n-n'-1} \\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m} (m+1)! [z^{m}] \exp(n'z) \left(\exp(z) - 1 - z\right)^{n-n'-1} \\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m} (m+1)! [z^{m}] \exp(n'z) \\ \times \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-n'-1-p)z) (-1)^{p} (1+z)^p \\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m} (m+1)! \\ \times [z^{m}] \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \exp((n-1-p)z) (-1)^{p} (1+z)^p \\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m} (m+1)! \\ \times \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \sum_{q=0}^{m} [z^{m-q}] \exp((n-1-p)z) (-1)^{p} [z^q] (1+z)^p \\ = (n-n') \frac{1}{n^2} \sum_{m\ge 0} \frac{1}{n^m} (m+1)! \\ \times \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \sum_{q=0}^{m} \frac{(n-1-p)^{m-q}}{(m-q)!} (-1)^{p} {p\choose q}.$$

Re-arranging the order of the sums now yields

$$(n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \\ \times \sum_{m\ge 0} \frac{1}{n^m} (m+1)! \sum_{q=0}^{m} \frac{(n-1-p)^{m-q}}{(m-q)!} (-1)^{p} {p\choose q} \\ = (n-n') \frac{1}{n^2} \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \\ \times \sum_{q\ge 0} (-1)^{p} {p\choose q} \sum_{m\ge q} \frac{1}{n^m} (m+1)! \frac{(n-1-p)^{m-q}}{(m-q)!}.$$

Simplifying the inner sum we get

$$\frac{1}{n^q} \sum_{m\ge 0} \frac{1}{n^m} (m+q+1)! \frac{(n-1-p)^{m}}{m!} \\ = \frac{(q+1)!}{n^q} \sum_{m\ge 0} \frac{1}{n^m} {m+q+1\choose q+1} (n-1-p)^m \\ = \frac{(q+1)!}{n^q} \frac{1}{(1-(n-1-p)/n)^{q+2}} = (q+1)! n^2 \frac{1}{(p+1)^{q+2}}.$$

We thus obtain for the sum of the probabilities

$$\sum_{m\ge 2} P[T=m] = (n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p} \sum_{q=0}^p {p\choose q} (q+1)! \frac{1}{(p+1)^{q+2}}.$$

Repeat to instantly obtain for the expectation

$$\bbox[5px,border:2px solid #00A000]{ E[T] = n (n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} (-1)^{p} \sum_{q=0}^p {p\choose q} \frac{(q+2)!}{(p+1)^{q+3}}.}$$

Now to simplify these we start with the inner sum from the probablity using the fact that

$$\sum_{q=0}^p {p\choose q} (q+1)! \frac{1}{(p+1)^{q+1}} = 1$$

which was proved by residues at the cited link from the introduction. We then obtain

$$(n-n') \sum_{p=0}^{n-n'-1} {n-n'-1\choose p} \frac{(-1)^{p}}{p+1} \\ = \sum_{p=0}^{n-n'-1} {n-n'\choose p+1} (-1)^p = - \sum_{p=1}^{n-n'} {n-n'\choose p} (-1)^p \\ = 1 - \sum_{p=0}^{n-n'} {n-n'\choose p} (-1)^p = 1 - (1-1)^{n-n'} = 1$$

which confirms it being a probability distribution. We will not attempt this manipulation with the expectation, since actual computation of the values indicates that it does not simplify as announced earlier. For example, these are the expectations for the pairs $(2n', n'):$

$$4,11,{\frac {347}{18}},{\frac {12259}{432}}, {\frac {41129339}{1080000}},{\frac {390968681}{8100000}}, {\frac {336486120012803}{5717741400000}}, \ldots$$

and for pairs $(3n', n'):$

$${\frac {33}{4}},{\frac {12259}{576}},{\frac {390968681}{10800000}}, {\frac {2859481756726972261}{54646360473600000}}, \ldots$$

The reader who seeks numerical evidence confirming the closed form or additional clarification of the problem definition used is asked to consult the following simple C program whose output matched the formula on all cases that were examined.

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>

int main(int argc, char **argv)
{
  int n = 6 , np = 3, j = 3, trials = 1000; 

  if(argc >= 2){
    n = atoi(argv[1]);
  }

  if(argc >= 3){
    np = atoi(argv[2]);
  }

  if(argc >= 4){
    j = atoi(argv[3]);
  }

  if(argc >= 5){
    trials = atoi(argv[4]);
  }

  assert(1 <= n);
  assert(1 <= np && np < n);  
  assert(1 <= j);
  assert(1 <= trials);

  srand48(time(NULL));
  long long data = 0;

  for(int tind = 0; tind < trials; tind++){
    int seen = np; int steps = 0; 
    int dist[n];

    for(int cind = 0; cind < n; cind++){
      if(cind < np) 
        dist[cind] = j;
      else
        dist[cind] = 0;
    }

    while(seen < n){
      int coupon = drand48() * (double)n;

      steps++;

      if(dist[coupon] == j-1)
        seen++;
      dist[coupon]++;
    }

    data += steps;
  }

  long double expt = (long double)data/(long double)trials;
  printf("[n = %d, np = %d, j = %d, trials = %d]: %Le\n", 
         n, np, j, trials, expt);

  exit(0);
}