The expected number of ingredients used to make $4$ pizzas if each pizza must contain $4$ of $100$ possible ingredients

probabilitystatistics

I'm trying to solve an optimization problem of the Google Hashcode contest and by analyzing the dataset I reduced a part of it down to a statistics problem, which I'm not able to solve. Any help would be appreciated.

Assume we have $100$ types of ingredients (from each type infinitely many) and we want to make $4$ pizzas. Each pizza must contain exactly $4$ distinct ingredients. All of the $100$ ingredients are equally likely for a candidate position of an ingredient for a pizza. What is the expected value of the total distinct ingredients used in all $4$ pizzas?

Here is my initial thought which is unlikely to be true:

$$E[N] = \sum_{i = 4}^{16}{i \times \frac{\binom{100}{i} {\binom{i}{4}}^4 – \binom{100}{i – 1} {\binom{i – 1}{4}}^4}{\binom{100}{16}{\binom{16}{4}}^4}} =15.9. $$
Numerator is essentially difference of total number of states with $i$ ingredients at most and with $i-1$ ingredients at most which gives the total number of states with exactly $i$ ingredients, while denominator is the total number of states with $16$ ingredients.

I think the answer for the above method ($15.9$) is not very intuitive. It seems more logical for the answer to be much more lower.

ps. I previously tried to put ${\binom{100}{4}}^4$ in the denominator but after doing a simple summation I ended up with a very unreasonable quantity.

ps2. Order of pizzas are not important.

Best Answer

Consider a particular ingredient:

  • The probability it is in the first pizza is $\frac4{100}=0.04$
  • The probability it is not in the first pizza is $1-0.04=0.96$
  • The probability it is not in any of the four pizzas is $0.96^4=0.84934656$
  • The probability it is in any of the four pizzas is $1-0.84934656=0.15065344$

But there are $100$ possibilities so the expected number of distinct ingredients in the four pizzas is $100 \times 0.15065344$ which is $$15.065344$$

To confirm the answer, here is a simulation in R:

set.seed(2021)
distinctingredients <- function(numpizzas, totingr, ingrperpizza){
length(unique(as.vector(replicate(numpizzas, sample(totingr, ingrperpizza)))))
}
mean(replicate(10^5, distinctingredients(4,100,4)))
# 15.06479

which seems close to my answer.

Related Question