Solved – How to calculate confidence interval when only a part of the samples are valid

confidence intervalnon-responsesample-sizesampling

I will simplify our problem in this way. Say there are 100,000 cases in total to examine. Due to the time limitation, we randomly selected 2,000 of them. Then we found 1,000 of them are invalid, so we have only 1,000 valid cases left. Finally we categorize these 1,000 valid cases into 2 categories, A and B; they have 300 and 700 cases respectively.

We want to calculate the confidence interval, in order to check the statistical significance of the results. In other words, if the result shows that there are 30% of the cases in category A, how trustworthy this percentage is when talking about the whole population. We used this website, http://www.surveysystem.com/sscalc.htm, to calculate the confidence interval, so the percentages will be like $30\pm 3.7\%$ and $70\pm 3.7\%$.

So there are two ways of deciding the population and the sample size.

(1) Population is 100,000 (real population) and sample size is 2,000;

(2) Population is 50,000 (estimated) and sample size is 1,000.

I think we should use option (2), because we actually found roughly only 50% of the original cases are valid cases. But this ratio is actually estimated by the 2,000 cases we sampled. How does this estimation affect the confidence of the result?

Would somebody recommend other ways to check the statistical significance of the result in our case?

Thanks for your help!

Best Answer

From a response to comment, we can adopt an urn model. The urn contains 100,000 balls representing all cases. An unknown number of these are black ("invalid"); they are of no interest. We are interested solely in the non-black balls in the urn. Of those, some are of color "A" and others of color "B". The main research question appears to be "what proportion of the balls of interest are A's?"

This urn model says option (2) is the one to use.

A simple random sample (without replacement) of 2,000 balls from this urn yielded 1000 black balls, 300 A's, and 700 B's, for n = 1000 A's & B's. The rest is routine. In particular, the distribution of A's (conditional on a non-black ball being drawn) is Binomial(p, 1000). A standard estimate of p is #A's / (Total A's & B's) = 30%. The estimated variance of the total is p(1-p), whence the variance of the estimated proportion of A's equals p(1-p)/n = 0.00021. Its square root, 1.45%, is the standard error of estimate of p. Because the numbers of A's and B's are large, yet are small compared to the expected number of non-black balls (about 50,000), it is appropriate to use normal-theory confidence intervals and to ignore the correction for sampling without replacement. (The correction shrinks the confidence interval to 0.99 times its width.) A 99% two-sided confidence interval therefore extends 2.58 * 1.45% = 3.73% to either side of the estimated proportions E.g., a confidence interval for the proportion of A's (out of all the A's and B's in the urn) extends from 26.27% to 33.73%.

If you are uncomfortable using conditional probabilities (which is at the root of this analysis), you can estimate the contents of the urn (i.e., total numbers of black balls, A's, and B's) using the multinomial distribution. You will get exactly the same results, because in the end you care only about the proportion of A's relative to the numbers of A's and B's, so all estimates involving the number of black balls never enter the calculation.

Another way to get some intuition is to recognize that (except for the tiny correction term being neglected here) the size of the confidence interval depends only on the observed numbers of A's and B's and not on the number of balls in the urn. That's why there's no concern here about whether the "population" is 50,000 or 100,000.

An auxiliary research question seems to be to estimate the total number of A's and B's in the urn. For this purpose the urn contains only two kinds of balls, black ones and non-black ones, and we want to estimate the number of non-black balls. This is a standard binomial sampling situation. Without more ado, the estimated number of non-black balls equals 100,000 * (1000/2000) = 50,000 and the estimated proportion is 1/2, with standard error $\sqrt{(1/2)(1 - 1/2)/2000}$ = 1.1%. Therefore the estimate of 50,000 has a 99% two-sided confidence interval from 48,560 to 51,440.

Related Question