Variation of coupon collector problem with distinct coupons in each type

coupon-collectorprobability

I am interested in yet another variation of the coupon collector problem. Specifically, we have an urn with $n_i$ "distinct" coupons of type $i$, where $i = 1,\ldots, K$ (for example, a type can be thought of as a color but within that color the coupons are numbered). The goal is to sample sufficiently with replacement to get a collection having at least $m_i$ distinct coupons from type $i$ for $i = 1,\ldots, K$ (any $m_i$ out of the $n_i$ different coupons within that type would work). Note that this is different from the variation in:

N. Shank and H. Yang, "Coupon Collector Problem for Non-Uniform Coupons and Random Quotas," Electronic J. of Combinatorics 20(2) 2013

since the coupons here are numbered and the analysis needs to ensure the $m_i$ coupons from type $i$ are distinct. So the question is:

For a given number of samples $n$, what is the probability that we achieve the requirement above (i.e., obtain $m_i$ distinct coupons from the types $i = 1,\ldots,K$). Upper and lower bounds would also be useful. Also, what is the expected value of the number of samples $E[N]$ to achieve this requirement?

Best Answer

I use the same notation as the cited article (so that the results can be directly compared), so consider an urn which for $i\in\{1,\ldots n\}$ contains $x_i\geq 1$ distinguishable coupons of type $i$, altogether $X_n:=x_1+\ldots+x_n$ coupons.

Coupons are drawn with replacement from this urn until (for each $i$) $m_i$ (where $1\leq m_i\leq x_i$) mutually different coupons of type $i$ have been drawn.

Let $m:=(m_1,\ldots,m_n)$ and $T(m)$ the random time at which this happens, and let $p_{x_i}(m_i,t):=\sum_{k=0}^{m_i-1}{x_i \choose k}\,t^k$.

Let $Y_1(k),\ldots,Y_n(k)$ be the random variables $Y_i(k):=$ number of different coupons of type $i$'' that have been drawn at "time" $k$.

I use generating functions and start from the following basic

Proposition The generating function of (the joint distribution of) $Y_1(k),\ldots,Y_n(k)$ is given by: \begin{equation*} \mathbb{E}\, t_1^{Y_1(k)}\ldots t_n^{Y_n(k)}=\frac{k!} {X_n^k}[t^k] (1+(e^t-1)\,t_1)^{x_1}\cdot\ldots\cdot(1+(e^t-1)\,t_n)^{x_n} \end{equation*}

Proof Let $j_1+\ldots +j_m\leq k$. Clearly $$\mathbb{P}(Y_1(k)=j_1,\ldots,Y_n(k)=j_n)=\frac{1}{X_n^k}\cdot {x_1 \choose j_1}\cdots {x_m \choose j_m}\cdot Sur(k,j_1+\ldots+j_m)$$ where $Sur(k,r)$ denotes the number of surjective mappings from $\{1,\ldots,k\}$ onto $\{1,\ldots,r\}$. It is known that $Sur(k,r)=k!\,[t^k]\,(e^t-1)^r$ (since a such a surjective mapping corresponds uniquely to an ordered partition of $\{1,\ldots,k\}$ into $r$ non-empty subsets, and $e^t-1$ is the exponential generating function for non-empty sets). The assertion about the g.f. follows. End of proof.

(I) The distribution of $T(m)$

Since $\{\,T(m)\leq k\}=\{\,Y_1(k)\geq m_1,\ldots,Y_n(k)\geq m_n\,\}$ the above gives \begin{equation*} \mathbb{P}(T(m)\leq k) =\frac{k!} {X_n^k}[t^k] (e^{tx_1} -p_{x_1}(m_1,e^t-1))\cdot\ldots\cdot(e^{tx_n} -p_{x_n}(m_n,(e^t-1)\,) \end{equation*}

From g.f. to probability: we have to compute \begin{align*}\mathbb{P}(\,Y_1(k)\geq m_1,\ldots,Y_n(k)\geq m_n\,) &=\sum_{j_1\geq m_1,\ldots, j_n\geq m_n} \mathbb{P}(Y_1(k)=j_1,\ldots,Y_n(k)=j_n)\\ &=\sum_{j_1\geq m_1,\ldots, j_n\geq m_n} [t_1^{j_1}\ldots t_n^{j_n}]\mathbb{E}\, t_1^{Y_1(k)}\ldots t_n^{Y_n(k)} \end{align*} Since the function after $[t^k]$ is a product of factors each containing only one of the $t_i$ variables we can treat these factors individually. E.g. to account for $\mathbb{P}(Y_1(k)\geq m_1,...)$ we have to sum up all the coefficients $[t_1^j]$ with $j\geq m_1$ of the first factor. We may equivalently sum up all coefficients (i.e. put $t_1=1$) and subtract the sum of the first $m_1$ coefficients. Doing that "inside" (and leaving the $t^k$ extraction "outside" (coefficients may be extracted in arbitrary order)) we arrive at the factor $e^{tx_1}-p_{x_1}(m_1,e^t-1)$, etc..

(II) The expectation of $T(m)$

Finally, using $\mathbb{E} T(m)=\sum_{k\geq 0} \mathbb{P}(T(m)>k)$ and writing ${k! \over X_n^k} =X_n\,\int_0^\infty s^k e^{-X_ns}\,ds$ leads to

$$\mathbb{E}(T(m))=X_n\int_0^\infty \big(1-\prod_{i=1}^n \left(1-p_{x_i}(m_i,e^s-1)\,e^{-x_i s}\right)\big)\,ds$$

Related Question