[Math] Coupon Problem generalized, or Birthday problem backward.

birthdaycombinatoricscoupon-collectordiscrete mathematicsprobability

I want to solve a variation on the Coupon Collector Problem, or (alternately) a slight variant on the standard Birthday Problem.
I have a slight variant on the standard birthday problem.

In the standard Coupon Collector problem, someone is choosing coupons at random (with replacement) from n different possible coupons. Duplicate coupons do us no good; we need a complete set. The standard question is "What is the expected number of coupons (or probability distribution in number of coupons) to collect them all?

In the standard birthday problem, we choose k items from n options with replacement (such as k people in a room, each with one of 365 possible birthdays) and try to determine the probability distribution for how many unique values there will be (will they have the same birthday?).

In my problem, someone has chosen k items from n options and I know that there were p distinct values, but I don't know what k was. If p=n this is the coupon problem, but I want to allow for values of p that are less than n. I want to determine the probability distribution for k (actually, all I need is the expected value of k, but the distribution would be interesting as well) as a function of p and n.

Best Answer

To do this properly, you need some sort of underlying distribution for $k$. For example you could use Bayesian methods: start with a prior distribution for $k$, multiply it by the likelihood of seeing $p$ for each $k$ given $n$, and divide by some constant to get a posterior probability distribution for $k$. You might then take the mean, median or mode of that posterior distribution as your central estimate; with an improper uniform prior, taking the mode would give the maximum likelihood estimate.

Another approach based on the coupon collector's calculation might be to assume that the person was aiming to get $p$ distinct items and stopped when this was achieved. In that case the expected value of $k$ would be $$\sum_{j=n-p+1}^n \frac{n}{j} = n(H_n - H_{n-p}) \approx n \log_e \left(\frac{n}{n-p}\right) $$ though the dispersion would be relatively wide. $H_n$ is the $n$th harmonic number.