Solved – How to calculate confidence interval when only a part of the samples are valid

confidence intervalnon-responsesample-sizesampling

I will simplify our problem in this way. Say there are 100,000 cases in total to examine. Due to the time limitation, we randomly selected 2,000 of them. Then we found 1,000 of them are invalid, so we have only 1,000 valid cases left. Finally we categorize these 1,000 valid cases into 2 categories, A and B; they have 300 and 700 cases respectively.

We want to calculate the confidence interval, in order to check the statistical significance of the results. In other words, if the result shows that there are 30% of the cases in category A, how trustworthy this percentage is when talking about the whole population. We used this website, http://www.surveysystem.com/sscalc.htm, to calculate the confidence interval, so the percentages will be like $30\pm 3.7\%$ and $70\pm 3.7\%$.

So there are two ways of deciding the population and the sample size.

(1) Population is 100,000 (real population) and sample size is 2,000;

(2) Population is 50,000 (estimated) and sample size is 1,000.

I think we should use option (2), because we actually found roughly only 50% of the original cases are valid cases. But this ratio is actually estimated by the 2,000 cases we sampled. How does this estimation affect the confidence of the result?

Would somebody recommend other ways to check the statistical significance of the result in our case?

Thanks for your help!

Best Answer

From a response to comment, we can adopt an urn model. The urn contains 100,000 balls representing all cases. An unknown number of these are black ("invalid"); they are of no interest. We are interested solely in the non-black balls in the urn. Of those, some are of color "A" and others of color "B". The main research question appears to be "what proportion of the balls of interest are A's?"

This urn model says option (2) is the one to use.

A simple random sample (without replacement) of 2,000 balls from this urn yielded 1000 black balls, 300 A's, and 700 B's, for n = 1000 A's & B's. The rest is routine. In particular, the distribution of A's (conditional on a non-black ball being drawn) is Binomial(p, 1000). A standard estimate of p is #A's / (Total A's & B's) = 30%. The estimated variance of the total is p(1-p), whence the variance of the estimated proportion of A's equals p(1-p)/n = 0.00021. Its square root, 1.45%, is the standard error of estimate of p. Because the numbers of A's and B's are large, yet are small compared to the expected number of non-black balls (about 50,000), it is appropriate to use normal-theory confidence intervals and to ignore the correction for sampling without replacement. (The correction shrinks the confidence interval to 0.99 times its width.) A 99% two-sided confidence interval therefore extends 2.58 * 1.45% = 3.73% to either side of the estimated proportions E.g., a confidence interval for the proportion of A's (out of all the A's and B's in the urn) extends from 26.27% to 33.73%.

If you are uncomfortable using conditional probabilities (which is at the root of this analysis), you can estimate the contents of the urn (i.e., total numbers of black balls, A's, and B's) using the multinomial distribution. You will get exactly the same results, because in the end you care only about the proportion of A's relative to the numbers of A's and B's, so all estimates involving the number of black balls never enter the calculation.

Another way to get some intuition is to recognize that (except for the tiny correction term being neglected here) the size of the confidence interval depends only on the observed numbers of A's and B's and not on the number of balls in the urn. That's why there's no concern here about whether the "population" is 50,000 or 100,000.

An auxiliary research question seems to be to estimate the total number of A's and B's in the urn. For this purpose the urn contains only two kinds of balls, black ones and non-black ones, and we want to estimate the number of non-black balls. This is a standard binomial sampling situation. Without more ado, the estimated number of non-black balls equals 100,000 * (1000/2000) = 50,000 and the estimated proportion is 1/2, with standard error $\sqrt{(1/2)(1 - 1/2)/2000}$ = 1.1%. Therefore the estimate of 50,000 has a 99% two-sided confidence interval from 48,560 to 51,440.

Related Solutions

Solved – Calculating necessary sample size using bootstrap

Ok, so this answer might not be exactly what you were after based on the detail of your question, but I stumbled across your question based on just the title and so this might help other people who also come across it in a similar fashion.

The only way I know of determining sample size using a bootstrap is via a power analysis approach. That is you:

State the null hypothesis and alternative hypothesis
State the alpha level (typically 5%)
If necessary shift the pilot study data so that you know the null hypothesis is false
Re-sample with replacements from the pilot study
Perform the test on the this sample and record the result
Repeat 1000 or so times to build up probability distribution
Count how many times the null hypothesis is rejected

With many possible "variations on a theme of..."

And that gives you the statistical power (for that sample size and that particular test), because the definition of statistical power is "probability that the test will reject the null hypothesis when the alternative hypothesis is true". So you can then vary the sample size until you achieve the desired power.

Here's an approach in R that I did based on this paper, Sample Size / Power Considerations, by Elizabeth Colantuoni.

I had two groups of non-normal, non-parametric data. A pilot study of each showed them to have differing medians and a Mann Whitney Wilcoxon test rejected the null hypothesis that they were the same, but I wanted to determine the sample size required so I could say this for "sure". Since the test already rejected the null hypothesis on the pilot data I did not see any need to shift or manipulate the data to ensure the alternative hypothesis was true.

power = function(group1.pilot, group2.pilot, reps=1000, size=10) {
    results  <- sapply(1:reps, function(r) {
        group1.resample <- sample(group1.pilot, size=size, replace=TRUE) 
        group2.resample <- sample(group2.pilot, size=size, replace=TRUE) 
        test <- wilcox.test(group1.resample, group2.resample, paired=FALSE)
        test$p.value
    })
    sum(results<0.05)/reps
}

#Find power for a sample size of 100
power(data1, data2, reps=1000, size=100)

Necessary disclaimer: I'm not a statistician and I'm still learning about bootstrapping so feedback, corrections and pointing and laughing are welcome.

Solved – Confidence interval for the population mean

The 95% confidence interval $\bar X \pm 1.96\frac{\sigma}{\sqrt{n}}$ for unknown $\mu$ is correct for normal data when the population standard deviation $\sigma$ is known. It is approximately correct for moderately large $n,$ when $\sigma$ is estimated by the sample standard deviation $S.$

However, you are correct to doubt this so-called 'z-interval' when $\sigma$ is estimated by $S$ and the sample size is small. Then the exact 95% CI for $\mu$ is given by $\bar X \pm t^*\frac{S}{\sqrt{n}},$ where $\pm t^*$ cut probability $0.025 = 2.5\%$ from the upper and lower tails, respectively, of Student's t distribution with $\nu = n-1$ degrees of freedom. [For example, if $n = 10,$ then $t^* = 2.262;$ computation in R.]

qt(.975, 9))
[1] 2.262157

At the 95% level in particular, $t^* \approx 2$ when $n \ge 30,$ so the z-interval gives pretty good results for $n \ge 30.$

qnorm(.975);  qt(.975,33)
[1] 1.959964
[1] 2.034515

More generally, for confidence level $(1 - \alpha)\%.$ there are other sample sizes $n$ at which $t^*$ is sufficiently near the $z^*$ that cuts probability $\alpha/2$ from the upper tail of the (symmetrical) standard normal distribution.

[For example, depending on ones degree of fussiness, something like $n=400$ might be large enough for a 98% CI; something like $n=12$ might be large enough for an 80% CI. But it is simpler just to use 't-intervals' whenever $\sigma$ is unknown and estimate by $S.]$

qnorm(.99);  qt(.99,400)
[1] 2.326348
[1] 2.335706

qnorm(.90);  qt(.90,11)
[1] 1.281552
[1] 1.36343

Note: You may sometimes see $n = 30$ given as a large enough sample size to pretend that the z-interval may be used even for non-normal data, and this can be very bad advice depending on the actual population distribution.

Best Answer

Related Solutions

Solved – Calculating necessary sample size using bootstrap

Solved – Confidence interval for the population mean

Related Question