Deterministic formula for average number of unique items picked

combinatoricsdeterministicdistributionsprobability

I am curious how to formulate a deterministic answer to a problem I have in mind. I have computed it stochastically, but am unsure of how to frame and compute the problem deterministically. My prob & stats classes are a few decades behind me…

Let's say that we have 100 people, and each person is asked to choose a number between 1 and 100 (inclusive). Maybe everyone chooses the same number, but this is unlikely. It is also unlikely that everyone chooses a different unique number. How many of the numbers from 1 to 100 will get picked?

Stochastically, it seems that somewhere between 59 and 67 of the numbers will be picked (10th percentile and 90th percentile respectively) with a median example resulting in 63 of the numbers being chosen.

Is there a deterministic way calculate this? What concepts are relevant? How does the formula vary for a different number of people? And for a different number of numbers to choose?

Update: I'll include the Python script I ran to clarify what I meant by "Stochastically, it seems…":

import random
import statistics

trials = 10000
n = 100
m = 100

lens = []
for j in range(trials):
    k = []
    for x in range(n):
        k.append(random.randint(1,m))
    lens.append(len(list(set(k))))

print(statistics.median(lens), statistics.quantiles(lens, n=10))

Best Answer

For greater generality, let's say that there are $n$ people and $m$ values to choose from. If you are given the chosen values $U_1,...,U_n$ then the deterministic formula for the number of values that were chosen is:

$$K \equiv \sum_{k=1}^m \mathbb{I}(Y_k > 0) \quad \quad \quad \quad \quad Y_k \equiv \sum_{i=1}^n \mathbb{I}(X_i = k).$$

The value $K$ is called the occupancy statistic and it is the number of unique values that were chosen out of the $m$ values (so it has range $1 \leqslant K \leqslant \min(n,m)$). This value is a deterministic function of the underlying values that were chosen. In a statistical context, if we assume that the values are chosen independently and are uniform over the possible choices then we have $U_1,...,U_n \sim \text{U} \{ 1,...,m \}$ and the value $K$ is then a random variable that follows the classical occupancy distribution (see e.g., O'Neill 2020 for details). This distribution has probability mass function:

$$\mathbb{P}(K=k) = \frac{(m)_k \cdot S(n,k)}{m^n} \quad \quad \quad \quad \quad \text{for } 1 \leqslant k \leqslant \min(n,m),$$

where $(m)_k \equiv \prod_{i=0}^{k-1} (m-i)$ are the falling factorials and $S(n,k)$ are the Stirling numbers of the second kind.

You can find a great deal of information about this distribution in the linked paper, including the main moments of this distribution, asymptotic results when the parameters are large, and approximation and computation results. You may also be interested to note that the various probability functions for this distribution (mass function, cumulative distribution function, quantile function, random generation function) are available in R using the functions docc, pocc, qocc and rocc in the occupancy package. These functions can quickly compute the distribution for any computationally feasible value of $n$ and $m$.

Related Question