Coupon collector’s problem with unequal probability and calculating the number of expected rare coupons

coupon-collectorexpected valueprobabilityprobability distributions

This is an instance of the "coupon collector's problem" in which not all the "coupons" have the same probability (more exactly, the probability of coupon $i$ follows Zipf distribution $p_i \sim Zipf(i)$.

There are $N$ distinct coupons. We keep collecting coupons till we have $N'$ distinct coupons. How many distinct rare coupons do we expect to collect? A rare coupon is defined as a coupon with probabilities less than $p_{N'}$. In other words, the top common $N'$ distinct coupons are not rare and the rest are rare.

Can someone point me in the right direction?

Best Answer

I’ll do this for separate numbers $m$ of distinct coupons collected and $n$ of distinct coupons regarded as common because I don’t see any simplification allowed by the special case $m=n=N'$.

By linearity of expectation, the expected number of distinct rare coupons collected is the sum over $i\gt n$ of the probabilities $c_i$ of coupon $i$ being collected.

This is a generalization of Last coupon collected in the coupon collectors problem, which amounts to the special case $m=N-1$. We can solve the more general problem with the same Poissonization approach.

So let coupons be drawn in a continuous Poisson process with rate $1$ that is the sum of $N$ independent Poisson processes with rates $p_i$.

The probability density for the process for coupon $i$ to yield its first coupon at time $t$ is $p_i\mathrm e^{-p_it}$. At that time, the probability that $m$ distinct coupons have not yet been collected is

$$ \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt}\;, $$

where $\ell=N-m$, $P_i$ is the set of sets of coupons other than $i$ with at least $\ell$ coupons, and $(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}$ is minus the Möbius function on $P_i\cup\{\emptyset\}$, ordered by inclusion, which you can check from

$$ \sum_{T\supset S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}=\sum_{\ell\le k\le|T|}(-1)^{k-\ell}\binom{k-1}{\ell-1}\binom{|T|}{k}=1\;. $$

Thus, the probability $c_i$ that when coupon $i$ is first collected $m$ distinct coupons have not yet been collected, and thus coupon $i$ gets collected, is

\begin{eqnarray} c_i &=& \int_0^\infty\mathrm dtp_i\mathrm e^{-p_it}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt} \\ &=& \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. \end{eqnarray}

For $m=N-1$ and thus $\ell=1$, we recover the probability for coupon $i$ not to be the last coupon collected,

$$ c_i=\sum_{S\subset P_i}(-1)^{|S|-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

For $m=1$ and thus $\ell=N-1$, the sum has a single term, in which $S$ is the set of all $\ell$ coupons other than $i$, and we recover the trivial result $c_i=p_i$.

The expected number of rare coupons collected is

$$ \sum_{i\gt n}c_i=\sum_{i\gt n}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

We can write this as a single sum by collecting the terms with the same denominator:

$$ \sum_{i\gt n}c_i=\sum_T(-1)^{|T|-\ell-1}\binom{|T|-2}{\ell-1}\frac{\sum_{n\lt j\in T}p_j}{\sum_{j\in T}p_j}\;, $$

where the sum now runs over all sets of at least $\ell+1$ coupons.

I don’t see how either of these representations could be simplified for the specific case of a Zipf distribution.