Solved – PMF for sum of hypergeometric distributions

distributionshypergeometric-distributionmixture-distribution

Basically, my question is the same as this one, except I need more than the $k = 0$ special case:

Given a sum of independent random variables each following a hypergeometric distribution, is there any efficient way to compute the PMF for that mixture?

At the very least, are there any tricks that might make a numerical evaluation less painful than a straightforward convolution (for cases where the number of variables and/or population size is very high)?

Best Answer

  1. Sampling in the ballpark of half the population doesn't usually lead to problems with the normal approximation. Here's an example for the distribution of the number of white balls drawn from a population of 300 white balls and 700 black balls, sampling 500 balls without replacement, along with a normal distribution with the same mean and variance as the hypergeometric.

      x=dhyper(120:180,300,700,500)
      plot(120:180,x,type="h")
    

    enter image description here

    This suggests that as long as the number of each kind of ball are not too large or small and the total population size is reasonably large, just using normal approximations (possibly with continuity correction, depending on circumstances) may be quite feasible.

  2. In cases where the normal approximation on the individual hypergeometric components isn't reasonable it may still be that a normal approximation to the sum may be adequate if there are enough terms in the sum.

  3. If the proportion of white balls (/"successes") is very large or vary small compared to the population, a binomial or even a Poisson approximation (to the black balls if the proportion of white is large) may be adequate, which suggests either a Poisson-binomial or Poisson approximation to the sum may be reasonable in some situations. In other cases, a moment-matched (possibly shifted-) binomial may be adequate.

  4. If none of those are adequate you may have to fall back on convolution. Assuming all terms have different parameters, so that it doesn't admit some shortcuts, $f=f_1*f_2*...*f_n$ can be approached in a number of different ways. Let $\hat{f}_i=\mathcal{F}(f_i)$ be the Fourier transform of the $i\,$th term; then the Fourier transform of the convolution is $\hat{f}=\prod_i \hat{f}_i$. These products might be generated one at a time (that is, looping through and taking the product of each with the next), or - especially if you have multiple cores that can run in parallel, they might be generated pairwise, then pairs combined and so on in turn, before inverting back to the resulting density, $f=\mathcal{F}^{-1}(\hat{f})$.

    I don't think the use of the Fourier transform should be slow unless there are so many components that suggestion 2. should probably work.