Probability – Detailed Binomial Approximation to Hypergeometric Probability

binomial distributionmathematical-statisticsprobability

I am trying to understand how to apply the binomial distribution to a simple probability problem. I can solve the problem directly via the classical definition of probability but, when trying to interpret the problem as sampling from a binomial distribution, I get different results.

Problem statement:

The prevalence of some disease in a given country is $p$. A sample of $n<N$ people is selected from a city with $N$ inhabitants (in that same country). What is the probability that exactly $k$ people in this sample have the disease?

method 1 (favourable/total): there are $\binom N n$ possible samples from the population. Out of those, we are interested in the ones with $k$ infected (out of $pN$) and $n-k$ healthy people (out of $(1-p)N$), which account for $\binom {pN} k\binom {(1-p)N} {n-k}$ possibilities. Thus, the probability is
$$
P = \frac {\binom {pN} k\binom {(1-p)N} {n-k}} {\binom N n}
$$

method 2 (binomial): It seems that this problem can be cast as sampling from a binomial distribution, with success probability $p$ and $n$ repetitions. We are interested in $k$ successes, thus we should have
$$
P(k) = \binom {n} k p^k (1-p)^{n-k}
$$

If we take concrete numbers, eg N=200, p=0.1, n=20, k=2, we end up with $P\approx0.30$ for method 1, while method 2 gives $P \approx 0.28$.

  • Why are these numbers different?
  • What is wrong with the binomial solution?
  • Should it somehow depend on the sample size $N$?

Best Answer

The exact probability is hypergeometric, as in the displayed equation in your Question. It assumes sampling without replacement. (That is the same person cannot be chosen twice.)

If $n$ is very much smaller than $N,$ then a binomial model, which assumes sampling with replacement may be useful. (The approximation is based on the relatively low chance the same person would be chosen more than once when only a few $n$ are chosen out of many $N.$ A common rule of thumb for usefulness of the binomial approximation is to have $n/N < 0.1.)$

Let's look at specific numbers to see how this plays out computationally. Let $N = 100,000,\, n = 500,\,k = 10,\, p = .02.$

Hypergeometric: The number of infected individuals in the city is 2000 and the remaining 98,000 are uninfected: $P(X = 10) = 0.1267$ and $P(X \le 10) = 0.5831.$ Computations in R, where dhyper and phyper are a PDF and a CDF of a hypergeometric distribution.

> dhyper(10, 2000, 98000, 500)
[1] 0.1266969
> phyper(10, 2000, 98000, 500)
[1] 0.5830506

Binomial approximation: Here $Y \sim \mathsf{Binom}(n = 500, p = .02).$ Then $P(Y = 10) = 0.1264$ and $P(Y \le 10) = 0.5830.$

> dbinom(10, 500, .02)
[1] 0.1263798
> pbinom(10, 500, .02)
[1] 0.583044

In these examples the binomial approximations are very good. The plot below shows this hypergeometric distribution (blue bars) and its binomial approximation (red). Within the resolution of the plot, it is difficult to distinguish between the two.

enter image description here

Note: With huge population sizes, the binomial coefficients in the hypergeometric PDF can become so large that they overflow R's ability to handle them. The program is written to minimize this difficulty, but even so, there are limits on what can be computed. R makes it possible to find log probabilities to prevent overflow; then you can take exponents to get answers.

Related Question