Solved – Probability of intersection from multiple sampling of the same population

hypergeometric-distributionprobability

Here is an example case:

  • I have a population of 10,000 items. Each item has an unique id.
  • I randomly pick 100 items and record down the ids
  • I put the 100 items back into the population
  • I randomly pick 100 items again, record down the ids and replace.
  • In total, I repeat this random sampling 5 times

What is the probability that $X$ number of items appear in all 5 random samplings?

I am not very well versed in statistics. Would this be correct for $X = 10$?

  • For each sampling, the number of possible combinations of 100 items from 10,000 is ${\rm binom}(10000, 100)$
  • Out of all possible combinations of 100 items, ${\rm binom}(9990, 90) * {\rm binom}(100, 10)$ combinations contain 10 specific items
  • The probability of having 10 specific items is $({\rm binom}(9990, 90) * {\rm binom}(100, 10)) / {\rm binom}(10000, 100)$
  • The calculated probability to the power of 5 would represent 5 indepenent samplings.

So essentially we are just calculating 5 independent hypergeometric probabilities and then multiplying them together? I feel like I am missing a step somewhere.

Best Answer

Compute the chances recursively.

Let $p_s(x)$ be the probability that exactly $x$ values, $0 \le x \le k$, are selected in all $s\ge 1$ independent draws of $k$ items (without replacement) from a population of $n \ge k \gt 0$ members. (Let's hold $n$ and $k$ fixed for the duration of the analysis so they don't have to be mentioned explicitly.)

Let $p_s(x\mid y)$ be the probability that if exactly $y$ values are selected in the first $s-1$ draws, then $x \le y$ of them are selected in the last draw. Then because there are $\binom{y}{x}$ subsets of $x$ elements of those $y$ elements, and $\binom{n-y}{k-x}$ subsets of the remaining $k-x$ elements are separately selected out of the other $n-y$ members of the population,

$$p_s(x\mid y) = \frac{\binom{y}{x}\binom{n-y}{k-x}}{ \binom{n}{k}}.$$

The law of total probability asserts

$$p_s(x) = \sum_{y=x}^k p_s(x\mid y) p_{s-1}(y).$$

For $s=1$, it's a certainty that $x=k$: this is the starting distribution.

The total computation needed to obtain the full distribution up through $s$ repetitions is $O(k^2 s)$. Not only is that reasonably quick, the algorithm is easy. One pitfall awaiting the unwary programmer is that these probabilities can become extremely small and underflow floating-point calculations. The following R implementation avoids this by computing the values of $\log(p_s(x))$ in columns $1, 2, \ldots, s$ of an array.

lp <- function(s, n, k) {
  P <- matrix(NA, nrow=k+1, ncol=s, dimnames=list(0:k, 1:s))
  P[, 1] <- c(rep(-Inf, k), 0)
  for (u in 2:s) 
    for (i in 0:k) {
      q <- P[i:k+1, u-1] + lchoose(i:k, i) + lchoose(n-(i:k), k-i) - lchoose(n, k)
      q.0 <- max(q, na.rm=TRUE)
      P[i+1, u] <- q.0 + log(sum(exp(q - q.0)))
    }
  return(P)
}
p <- function(...) zapsmall(exp(lp(...)))

The answer to the question is obtained by letting $s=5,$ $n=10000=10^4$, and $k=100=10^2$. The output is a $101\times 5$ array, but most of the numbers are so small we may focus on very small $x$. Here are the first four rows corresponding to $x=0,1,2,3$:

p(5, 1e4, 1e2)[1:4, ]

The output is

  1         2         3      4        5
0 0 0.3641945 0.9900484 0.9999 0.999999
1 0 0.3715891 0.0099034 0.0001 0.000001
2 0 0.1857756 0.0000481 0.0000 0.000000
3 0 0.0606681 0.0000002 0.0000 0.000000

Values of $x$ label the rows while values of $s$ label the columns. Column 5 shows the chance that one element appears in all five samples is minuscule (about one in a million) and there's essentially no chance that two or more elements appear in all five samples.

If you would like to see just how small these chances are, look at their logarithms. Base 10 is convenient and we don't need many digits:

u <- lp(5, 1e4, 1e2)[, 5]
signif(-u[-1] / log(10), 3)

The output tells us how many zeros there are after the decimal point:

    1     2     3     4     5     6     7     8     9    10  ...   97    98    99   100 
  6.0  12.3  18.8  25.5  32.3  39.2  46.2  53.2  60.4  67.6 ... 917.0 933.0 949.0 967.0 

Numbers in the top row are values of $x$. For instance, the chance of exactly three values showing up in all five samples is found by computing exp(u[4]), giving $0.000\,000\,000\,000\,000\,000\,1434419\ldots$ and indeed this has $18$ zeros before the first significant digit. As a check, the last value $967.0$ is a rounded version of $967.26$. $\binom{10000}{100}^{-4}$ (which counts the chances that the first sample reappears in the next four samples) equals $10^{-967.26}.$

Related Question