[Math] How to solve a generalization of the Coupon Collector’s problem

pr.probability

The coupon collector's problem is a problem in probability theory that states the following (from wikipedia):

Suppose that there are $n$ coupons, from which coupons are being collected with replacement. What is the probability that more than $t$ sample trials are needed to collect all $n$ coupons?

A generalization of this problem was proposed by Newmann & Shepp, by requiring that $k$ samples of each coupon be collected. The answer to this is known.

I, however, need to calculate the answer to an even further generalization, which is:

How many sample trials are needed to collect at least $k$ coupons of $m$ different types?

Any help or a point in the right direction would be greatly appreciated.

(Edit#3): removed the misleading example.

(Edit #2): This problem can be stated simply as a balls-and-bins problem. If we have $n$ bins, how many balls need to be thrown so that at least $m$ bins have at least $k$ balls?

Best Answer

If you want the expected value, one answer is $n E[S_{(m)}]$, where $S_{(m)}$ is the $m$th order statistic of a sample of $n$ gamma$(k,1)$ random variables. While this expression may not have a simple closed form, you may be able to get a decent-sized approximate answer from the literature on moments of order statistics. (Edit: This appears to be the case, even when comparing with the known asymptotic expression for the case $m=n$. See discussion at end.)

Here's the argument for $n E[S_{(m)}]$: Take a Poisson process $P$ with rate 1 and interarrival times $Z_1, Z_2, \ldots$. Let each event in the process $P$ have probability of $1/n$ of being the first kind of coupon, probability $1/n$ of being the second kind of coupon, and so forth. By the decomposition property of Poisson processes, we can then model the arrival of coupon type $i$ as a Poisson process $P_i$ with rate $1/n$, and the $P_i$'s are independent. Denote the time until process $P_i$ obtains $k$ coupons by $T_i$. Then $T_{i}$ has a gamma$(k,1/n)$ distribution. The waiting time until $m$ processes have obtained $k$ coupons is the $m$th order statistic $T_{(m)}$ of the iid random variables $T_1, T_2, \ldots, T_n$. Let $N_m$ denote the total number of events in the processes at time $T_{(m)}$. Thus $N_m$ is the random variable the OP is interested in. We have

$$T_{(m)} = \sum_{r=1}^{N_m} Z_r.$$

Since $N_m$ and the $Z_r$'s are independent, and the $Z_r$ are iid exponential(1), we have

$$E[T_{(m)}] = E\left[E\left[\sum_{r=1}^{N_m} Z_r \bigg| N_m \right] \right] = E\left[\sum_{r=1}^{N_m} E[Z_r] \right] = E\left[N_m \right].$$

By scaling properties of the gamma distribution, $T_i = n S_i$, where $S_i$ has a gamma$(k,1)$ distribution. Thus $T_{(m)} = n S_{(m)}$, and so $E\left[N_m \right] = n E[S_{(m)}]$.

For more on this idea, see Lars Holt's paper "On the birthday, collectors', occupancy, and other classical urn problems," International Statistical Review 54(1) (1986), 15-27.

(ADDED: Looked up literature on moments of order statistics.)

David and Nagaraja's text Order Statistics (pp. 91-92) implies the bound $$n P^{-1}\left(k,\frac{m-1}{n}\right) \leq n E[S_{(m)}] \leq n P^{-1}\left(k,\frac{m}{n}\right),$$ where $P(k,x)$ is the regularized incomplete gamma function.

Some software programs can invert $P$ for you numerically. Trying a few examples, it appears that the bounds given by David and Nagaraja can be quite tight. For example, taking $n$ = 100,000, $m$ = 50,000, and $k$ = 25,000, the two bounds give estimates (via Mathematica) around $2.5 \times 10^9$, and the difference between the two estimates is about 400. More extreme values for $k$ and $m$ give results that are not as good, but even values as extreme as $m$ = 10, $k$ = 4 with $n$ = 100,000 still yield a relative error of less than 3%. Depending on the precision you need, this might be good enough.

Moreover, these bounds seem to give better results for $m \approx n$ versus using the asymptotic expression for the case $m = n$ given in Flajolet and Sedgewick's Analytic Combinatorics as an estimate. The latter has error $o(n)$ and appears to be for fixed $k$. If $k$ is small, the asymptotic estimate is within or is quite close to the David and Nagaraja bounds. However, for large enough $k$ (say, on the order of $n$) the error in the asymptotic is on the order of the size of estimate, and the asymptotic expression can even produce a negative expected value estimate. In contrast, the bounds from the order statistics approach appear to get tighter when $k$ is on the order of $n$.

(Caution: There are two versions of the regularized incomplete gamma function: the lower one $P$ that we want with bounds from $0$ to $x$, and the upper one $Q$ with bounds from $x$ to $\infty$. Some software programs use the upper one.)

Related Question