How many draws to get one of each with replacement

probability

Assuming I have a bag with all $26$ letters in it, how many times would I need to draw out of the bag to have a $90\%$ chance of getting at least one of each letter?

I think drawing $26$ times with replacement would be $\frac{26!}{26^{26}}$ to get exactly one of each letter, but I'm have trouble calculating with a bigger number of draws such as drawing a letter (with replacement) $52$ times and trying to calculate the probability that I got at least one of each letter (but not caring about which letters are duplicated or if they are duplicated multiple times). I keep getting numbers greater than $1$.

Background: I bought a bag a choclates with letters on the chocolate, but I didn't get all the letters so I'm wondering how many I would have to buy to get all the letters. I'm assuming they made equal numbers of all the letters and that they made a sufficiently large amount of chocolates so we can ignore that they are not replacing them by the law of large numbers since our sample will be less than $5\%$ of the population.

Best Answer

This is a special case of the so-called "coupon collector's problem"

https://en.wikipedia.org/wiki/Coupon_collector

Calculating the expected number of draws required along with the variance is fairly easy using the linearity of expectation. From here, one can obtain probability bounds using Chebyshev's inequality. I don't know of a way to obtain the pmf directly, although you could try simulation.

Update: I played around with this in R. The Chebyshev bound says that 206 draws will be sufficient to have at least a 90% probability of drawing all 26. The exact tail probability from Ross, pointed out in @awkward's answer, gives $P(T \geq 206) \approx 0.0008$ and $P(T \geq 141) \approx 0.099$ for your example. Here's my R code if you're interested:

# This script calculates upper-tail probabilities for the coupon collector's
# problem in three different ways: 
#     1. A bound based on Markov's inequality
#     2. A bound based on Chebyshev's inequality
#     3. An explicit solution from Ross "A First Course in Probability"

# Compute expected number of trials, E[T], needed to obtain all n coupons
get_mean <- function(n) {
  n * sum(1 / (1:n))
}

# Compute an upper bound for P(T >= a) using Markov's inequality
get_markov_tail <- function(a, n) {
  stopifnot(a > 0)
  get_mean(n) / a
}

# Compute an upper bound for P(T >= a) using Chebyshev's inequality
get_chebyshev_tail <- function(a, n) {
  mean_T <- get_mean(n)
  stopifnot(a > mean_T)
  n^2 * pi^2 / (6 * (a - mean_T)^2)
}

# Exact formula for P(T >= a) from Section 4.1 of Ross
get_exact_tail <- function(a, n) {
  k_seq <- 1:(n-1)
  term1 <- exp(lchoose(n, k_seq) + a * (log(n - k_seq) - log(n)))
  term2 <- (-1)^(k_seq + 1)
  sum(term1 * term2)
}


# Using these functions we'll now try an experiment. Suppose there is a total of
# 26 coupons and we want to find a such that P(T >= a) < 0.1. In other words,
# we want to know how many trials we would need for there to be at least a 90%
# chance that we will obtain all 26 coupons. How do the two upper bounds compare
# to each other, and how do them compare to the exact formula?

# The Markov bound is terrible:
get_markov_tail(1002, 26)
get_markov_tail(1003, 26)

# The Chebyshev bound is a massive improvement:
get_chebyshev_tail(1002, 26)
get_chebyshev_tail(1003, 26)
get_chebyshev_tail(205, 26)
get_chebyshev_tail(206, 26)

# The exact tail probability is a substantial improvement over Chebyshev:
get_exact_tail(1002, 26)
get_exact_tail(1003, 26)
get_exact_tail(205, 26)
get_exact_tail(206, 26)
get_exact_tail(140, 26)
get_exact_tail(141, 26)
Related Question