[Math] Definition and statistics of the Negative-Hypergeometric distribution

probability distributions

The Encyclopedia of Mathematics defines the Negative Hypergeometric distribution (NHG) in the following way:

There are $N$ elements, of which $M$ are marked and the rest are unmarked. Elements are drawn at random without replacement, until the sample contains a constant number $m$ of marked elements. Then, the number of unmarked elements in the sample has a NHG distribution.

They give its PMF, mean, variance, and a relation to the hypergeometric distribution (HG – the number of marked elements when the total sample size is a constant number $n$), but I didn't understand any of them. I tried to solve it myself and got quite different expressions:

  1. Regarding the relation between HG and NHG: if the number of marked elements is $m$, then the number of unmarked elements is $k$ iff the sample size is $k+m$. Hence:
    $$NHG[N,M,m](k) = HG[N,M,k+m](m)$$

  2. Regarding the PMF of the NHG: we have to choose $m$ out of $M$ marked elements, and $k$ out of $N-M$ unmarked elements, and divide by the total number of ways to choose $m+k$ out of $N$. So the PMF should be:

$$NHG[N,M,m](k) = \frac{{M\choose m} {N-M\choose k}}{{N\choose{m+k}}} $$

  1. Regarding the mean of the NHG: the mean of $HG[N,M,n]$ is $n\frac{M}{N}$, so the mean of $HG[N,M,k+m]$ (the number of successes) is $(k+m)\frac{M}{N}$. But the number of successes is also $k$. So we have the following equation:

$$E[k] = (E[k]+m)\frac{M}{N}$$

Hence:

$$E[k] = \frac{m\frac{M}{N}}{1-\frac{M}{N}} = \frac{m M}{N-M}$$

Are my calculations correct? If not, what is the explanation to the formulas in the encyclopedia?

Best Answer

Let random variable $Y$ have negative hypergeometric distribution with parameters as in the OP. We find $\Pr(Y=k)$. The event $Y=k$ happens if we have exactly $m-1$ marked in the first $m+k-1$ trials, and then a marked on the $(m+k)$-th trial.

Using reasoning like yours, we find that $$\Pr(Y=k)=\frac{\binom{M}{m-1}\binom{N-M}{k}}{\binom{N}{m+k-1}}\cdot\frac{M-(m-1)}{N-(m+k-1)}.$$ This can be manipulated into various equivalent forms.

Remark: Note that your analysis in 1.) is not right, for the analysis does not take into account the fact that we stop as soon as we get $m$ marked.

One can find an expression for the mean by using an indicator random variable argument. This has been done in the past on MSE, at least once by me, but of course I cannot find it. The calculation was for a close relative of your negative hypergeometric, in which we count the number of trials until the first marked.

Related Question