Confidence Interval – What is the Finite-Population Correction for the Wilson Score Interval for a Binary Proportion?

binomial distributionconfidence intervalfinite-population

The Wilson score interval is an approximate confidence interval for an unknown binary proportion for IID binary data. The interval is based on the central limit theorem like the Wald interval, but it uses a transformed pivotal quantity that ensures that the interval is concentrated on the true support of the proportion parameter. Given data $X_1,X_2,…,X_n \sim \text{IID Bern} (\theta)$ the interval at $1-\alpha$ confidence level is given by:

$$\text{CI}_\infty(1-\alpha) = \Bigg[ \frac{n \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n + \chi_{1,\alpha}^2} \cdot \sqrt{n \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$

where $\hat{\theta}_n$ is the observed sample proportion and $\chi_{1,\alpha}^2$ is the critical point of the chi-squared distribution with one degree-of-freedom (using upper tail area $\alpha$). The Wilson score interval is known to have good coverage properties and to perform well against other confidence intervals for a binary proportion (see Brown, Cai and DasGupta (2001) for further discussion). In particular, it is superior to the standard Wald interval.

The above interval form is for the parameter $\theta$ which (from the law of large numbers) is the long-run proportion of positive outcomes in an infinite superpopulation (see this related answer for technical details). However, in many cases of interest we would instead be interested in forming a confidence interval for the proportion in a finite population of size $n \leqslant N < \infty$.


Question: What is the appropriate adjustment for this confidence interval when the sample comes from a finite population $N$ and the parameter of interest in the inference is the (finite) population proportion $\theta_N \equiv \sum_{i=1}^n X_i /N$? (Note that this latter parameter must have a value in the range $0,\tfrac{1}{N},…,\tfrac{N-1}{N},1$.)

Best Answer

This answer is based on information in the paper O'Neill (2021), which sets out mathematical properties of the Wilson score interval, including the finite population correction for inference to a finite population or the unsampled part of a finite population. Please see the linked paper for more details on the matter discussed here.

CI for the proportion in a finite population: You can find a derivation of the standard Wilson score interval in this related answer. The present analysis, adjusting for a finite population, is relatively simple to do using sample moment results found in O'Neill (2014). To facilitate this analysis, define the effective sample size for the inference as:

$$n_* \equiv n \cdot \frac{N-1}{N-n}.$$

As in the linked paper, we will derive the confidence interval via standard manipulations but starting from a pivotal quantity for the finite-population inference. Our pivotal quantity is:

$$n_* \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1).$$

As in the question, let $\chi_{1,\alpha}^2$ denote the critical point of the chi-squared distribution with one degree-of-freedom (with upper tail area $\alpha$). For any confidence level $1-\alpha$ we then have the probability interval:

$$\begin{align} 1-\alpha &\approx \mathbb{P} \Big( n_* (\hat{\theta}_n-\theta_N)^2 \leqslant \chi_{1,\alpha}^2 \theta_N (1-\theta_N) \Big) \\[6pt] &= \mathbb{P} \Big( n_* (\hat{\theta}_n^2 - 2 \hat{\theta}_n \theta_N + \theta_N^2) \leqslant \chi_{1,\alpha}^2 (\theta_N-\theta_N^2) \Big) \\[6pt] &= \mathbb{P} \Big( (n_* + \chi_{1,\alpha}^2) \theta_N^2 - (2 n_* \hat{\theta}_n + \chi_{1,\alpha}^2) \theta_N + n_* \hat{\theta}_n^2 \leqslant 0 \Big) \\[6pt] &= \mathbb{P} \Bigg( \theta_N^2 - 2 \cdot\frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \cdot \theta_N + \frac{n_* \hat{\theta}_n^2}{n_* + \chi_{1,\alpha}^2} \leqslant 0 \Bigg) \\[6pt] &= \mathbb{P} \Bigg( \bigg( \theta_N - \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \bigg)^2 \leqslant \frac{\chi_{1,\alpha}^2 (n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2)}{(n_* + \chi_{1,\alpha}^2)^2} \Bigg) \\[6pt] &= \mathbb{P} \Bigg( \theta_N \in \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg] \Bigg), \\[6pt] \end{align}$$

and substitution of the observed sample proportion (for simplicity I will use the same notation for this value) then leads to the Wilson score interval:

$$\text{CI}_N(1-\alpha) = \Bigg[ \frac{n_* \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_* + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_* + \chi_{1,\alpha}^2} \cdot \sqrt{n_* \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$

As can be seen, the finite-population correction in the Wilson score interval consists of replacing the sample size $n$ with the effective sample size $n_*$. This is an extremely simple adjustment, and it gives a confidence interval that is valid for any population size $N \geqslant n$ (finite or infinite). It is trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_* = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_* \rightarrow \infty$ and the confidence interval reduces to the single point $\text{CI}_n(1-\alpha) = [\hat{\theta}_n]$, just as we would expect.


CI for the unsampled proportion in a finite population: A useful variant of this problem occurs when you are interested in making an inference about the proportion of positive outcomes in the unsampled part of a finite population, which is $\theta_{n:N} \equiv \sum_{i=n+1}^N X_i/(N-n)$. In this case, we can use the pivotal quantity:

$$n_{**} \cdot \frac{(\hat{\theta}_n-\theta_N)^2}{\theta_N (1-\theta_N)} \overset{\text{Approx}}{\sim} \text{ChiSq}(1) \quad \quad \quad n_{**} \equiv n \cdot \frac{N-n}{N-1},$$

which uses the alternative value $n_{**}$ for the effective sample size. The remaining calculations are the same as above, except that $n_{**}$ replaces $n_{*}$ in the formulae, giving the confidence interval:

$$\text{CI}_{n:N}(1-\alpha) = \Bigg[ \frac{n_{**} \hat{\theta}_n + \tfrac{1}{2} \chi_{1,\alpha}^2}{n_{**} + \chi_{1,\alpha}^2} \pm \frac{\chi_{1,\alpha}}{n_{**} + \chi_{1,\alpha}^2} \cdot \sqrt{n_{**} \hat{\theta}_n (1-\hat{\theta}_n) + \tfrac{1}{4} \chi_{1,\alpha}^2} \Bigg].$$

It is again trivial to confirm that this reduces down to the standard Wilson score interval when $N = \infty$ (giving $n_{**} = n$). It is also useful to note that when you take a full census of the population with $N=n$ you get $n_{**} = 0$ and the confidence interval reduces to the vacuous interval $\text{CI}_{n:n}(1-\alpha) = [0,1]$, just as we would expect.


Implementation in R: These generalised confidence intervals are implemented in the CONF.prop function in the stat.extend package. The function implements the Wilson score intervals in the form shown above. By default the function uses the standard interval for the proportion parameter for an infinite population. However, the function also allows specification of a population size N and a logical value unsampled to give the above interval forms.

Related Question