Solved – Confidence interval for a proportion estimated through stratified sampling

binomial distributionconfidence intervalsamplingstratification

When estimating the confidence interval for a binomial proportion $p$, with $\hat{p}$ near 0 or 1, one has to use something other than the Wald interval to get a reasonable estimate
(see for instance Brown, Cai, DasGupta (2001)). But what if my sample is obtained through stratified sampling, and $\hat{p}$ for each sample is equal to, or near zero? It seems to me that adjusting in each stratum is an overkill. What is the right process to obtain a confidence interval for $\hat{p}$ in this scenario?

Background/Context: I have about 250 records of individuals and I would like to investigate the prevalence of a certain illness among these individuals. The task of examining the records is long and costly, so I decided to take a sample and examine the records in the sample instead. I could stratify the 250 records into 40 strata, each consisting of records that I believe would be similar as far as the occurrence of the illness among them is concerned (for instance they lived in the same geographical area.) The strata have as few as 2 and as high as 15 records in them (many have 2 records). From each stratum, I picked a random record and examined it. None of the records examined indicated the occurrence of the illness. What is my confidence interval for $\hat{p}$?

Best Answer

I have no real answer for you, only some thoughts. You are unlucky in that illness is so rare. I'll first note that this design would have caused trouble even if illness was common. For example, the SE formula for the weighted prevalence requires $n_h$>1 observation per stratum (Cochran, 1977, Chapter 5).

You ask if it is okay to ignore the stratification and apply a formula for an exact CI. There's no real justification for this formula: the theory assumes simple random sampling (SRS). In that design every observation has the same probability of selection. In your design, a stratified sample, the probabilities range from 1/2 to 1/15, or, more formally $1/N_h$, where $N_h$ is the size of stratum h. The SRS CI endpoint will be biased if you over-sampled or under-sampled strata with higher expected prevalences.

You can, however check on this directional bias. You have some knowledge of risk predictors for the illness-the characteristics you used to form the strata. As best you can, form G groups of strata with different levels of risk and rank the groups from lowest expected risk to highest. Then plot the individual $N_h$ and the group mean $N_h$ against group number. A positive trend (average $N_h$ increasing with group number) will indicate that you under-sampled the higher risk groups. This might partly account for the failure to see any cases. A negative trend would show that you over-sampled the high risk groups. In that case the failure to see cases is partly due to bad luck and to taking too small a sample.

Theory for Simple Random Sampling without replacement

Let the unknown number of patients with illness be D, assumed >0; then the prevalence of illness $P$ is

$$ P = \frac{D}{N} $$

Note that D can take only integer values.

Suppose number of observed patients with the condition is T. Then T has a hypergeometric distribution, not a binomial distribution, because the population size is finite (Cochran, 1977, p. 55). (This accounts for the appearance of the finite population correction for variances in sampling without replacement).

The parameters for the hypergeometric distribution are $N$, the population size, $D$ the number of patients with the illness in the population, and $n$, the sample size. The probability that $T = d$ is:

$$ \text{Pr($T =d \vert N, n,D$)} =\dfrac{ { D\choose{d}} {N -D\choose{n-d}}} {{N \choose{n}}} $$

Confidence interval for SRS without replacement

I'll demonstrate the CI that would have been valid for a simple random sample. With population size $N$, events in the population, $d$ events in the sample, and a sample size of $n$. The one-sided $1-\alpha$ endpoint for $D$ is the largest value D for which

$$ P(T \leq d \> \vert \> N, n, D) \leq \alpha $$

where T has a hypergeometric distribution with parameters (N, n, and D). This CI is based on inverting a hypothesis test about D. See, e.g. Blaker, 2000.

With d = 0, this is

$$ P(T =0 \> \vert \> N, n, D) \leq \alpha $$

In your study, $N=2500$, $n= 40$, and $d=0$. Suppose this data had been generated by a SRS. I used Stata's hypergeometric function to generate a one-sided 80% CI. I choose 80% because in such a situation, my practice is to trade confidence for a smaller interval.

Under SRS, the upper bound of the one-sided 80% (actually 79.8%) hypergeometric CI for $D$ would be $D_u$ =9, which corresponds to a prevalence of $\hat{P}$= 9/250 = 3.6%. The corresponding one-sided binomial interval which ignores the finite sampling would $\hat{P}$= 3.9%. You can see that the hypergeometric interval is shorter. Both intervals are likely to be conservative, with the true probability of coverage greater than the nominal 80% (Blaker, 2000).

Actual distribution: weighted sum of Bernoulli variables

Let $h$ index strata. In stratum $h$, let $n_h$ the size of sample (=1 here), $d_h$ be the number with illness in the sample (= 0 or 1, here) , $D_h$ be the unknown number of patients in the population with illness, $P_h= D_h/N_h$ be the unknown prevalence in the strata.

If the sum of the $D_h$ is $D$ is the unknown number of ill patients in the population. The estimated prevalence is

\begin{align} \hat{P} & = \frac{\hat{D}}{N} \end{align}

with

\begin{align} \hat{D} = \sum_h \dfrac{N_h}{n_h} d_h = \sum_h N_h d_h \end{align}

With $n_h$=1, the distribution of $d_h$ is that of a Bernoulli 0-1 random variable with probability $p_h$ = $D_h/N_h$. Thus $\hat{D}$ is a weighted sum of these..

I don't know how to do a hypothesis test for $D$ in this situation; so don't have a test to invert to get a confidence interval. The problem is that there is no single probability distribution for $\hat{D}$ for each possible value $D_0$; there is a different distribution for each compatible set of the $D_h$ for which $\sum_h D_h = D_0$.

Other Designs

Confronted with a population with a rare outcome, there are not many good choices. A larger sample would have helped. For a rare outcome such as yours, I would have tried inverse sampling: sample randomly until one case was found, so that the number of trials is the random variable. There are CI formulas for the case of independent samples (See Zou, 2010), but I haven't found one for the case of without-replacement sampling, where the relevant distribution is the "negative hypergeometric", which is the same as the beta-binomial distribution,

There is a theory of optimal design, and I state it for background. According to the theory, selection probability $\pi$ for an observation should be proportional to the expected "size" of the observation, in this case its risk of disease. For stratified sampling (Cochran, 1977, Chapter 5), you'd form a small number of strata in which the observations have similar expected very low risks $P_h$, then make the selection fraction $n_h/N_h \propto P_h(1-P_h)$, which is very close to $P_h$ for small risks. It's unlikely that you'd be able to quantify actual risks, but you get the idea: higher risk patients are selected with higher probabilities.

A practical tactic is to identify a group of $N_1$ patients with risks so low that that you are very sure there are no cases among them. This leaves $N_2 = N -N_1$ people. You then omit them from the inverse sampling. If the upper endpoint CI from inverse or random sampling is $\hat{P_2}$, the estimated prevalence in the population is $\hat{P} = \dfrac{N_2}{N} \hat{P_2}$.

References

H. Blaker, 2000. Confidence curves and improved exact confidence intervals for discrete distributions. Canadian Journal of Statistics Can J Statistics 28, no. 4: 783-798.

Cochran, William G. 1977. Sampling Techniques. New York: Wiley.

Zou, G.Y. 2010. Confidence interval estimation under inverse sampling. Computational Statistics & Data Analysis 54, no. 1: 55-64.