Confidence Interval – How to Determine the Correct Sample Size for Calculating CI in a Subset

binomial distributionconfidence intervalsurvey

Suppose I have a study of 3031 people that obtains responses to various questions (95% CL). One of the questions (Q1a) gets a yes response from 616 of the people. Of those 616, only 34 have a QualityB (the remaining 582 don't) that is dependent on a yes answer to (Q1a).

When determining the margin of error for QualityB (Confidence Interval) using this online calculator, do I use 34 or 616 for the sample size? I think I should use 616 for the sample size, 3031 for population size, and 5% for percentage, but I am not sure.

  • Using 616 as the sample size, result is +/- 1.54%.
  • Using 34 as the sample size, result is +/- 7.29%.

Best Answer

As I interpret the question, this problem can easily be resolved through some careful reasoning about what is going on and what the survey objectives are. We can even do some simple mental calculations as a check on the confidence interval produced by the software.

Consider this model of the survey. In a population, members have the following attributes: their answer to Q1a ("yes" or "no") and their QualityB (present or absent). However, a QualityB value is available and meaningful only for those who answered "yes". One purpose of the survey is to estimate the proportion of "yes" answerers who have QualityB present. To this end, the survey has selected 3031 people independently and randomly from the population.

If this model reasonably approximates the survey and its objective, then notice that the randomization procedure by which all 3031 people were selected constitutes a fortiori a random procedure to select among the "yes" answerers. However (unlike specifying the sample size of 3031, which is usually determined by the investigator), the number of "yes" answerers was not determined in advance: it, too, is a random quantity.

Nevertheless, in part because the subsample size of 616 is so large, it is a reasonable approximation to analyze it as if it were a random sample of 616 people chosen from just the "yes" answerers in the population. (As a partial justification of why we can consider 616 "large," by using the approximate Binomial theory calculations below, one can figure out that a comparable survey would likely have included between 579 and 653 "yes" answerers; this amount of variation would not change the calculation of the confidence limits much at all.) Accordingly, the analysis of the 34 people with QualityB present can proceed as usual.

The binomial theory is applicable: we estimate the proportion of QualityB people out of the "yes" answerers as $34/616$ = $5.5$% and we estimate the variance of that proportion as $(34/616)(1 - 34/616)$, for an estimated standard deviation of $0.22836$. Because the subsample size is $616$, the standard error of the proportion is $0.22836/\sqrt{616}$ = $0.92$%. So--just to see where this is headed--we could use a normal approximation as a rough check. This tells us to expect the confidence interval procedure to give us a range from about $5.5$ - $1.65 \times 0.92$ = $4.0$% to $5.5 + 1.65 \times 0.92$ = $7.0$%. This range is very close to the quoted value of $5.5 \pm 1.54$%. (The multiplier of $1.65$ should give, approximately, a $90$% two-sided confidence interval.)

We conclude that the proportion of people with QualityB among all "yes" answerers in the population is likely to be between $4$% and $7$%. To deduce this, we have used a procedure that will mislead us (by the luck of the draw) at most $10$% of the time it is appropriately applied; that's where our "confidence" comes from.


Edit

Because some questions about the validity of this answer have been raised in comments, let's check. One way is to bootstrap the data to assess the bias. But before proceeding, let's recast the problem in a more concrete form.

Suppose, then, we are interested in the proportion of U.S. senior citizens (defined, say, as age 55 or older on January 1, 2012 and were resident in the U.S. on that date) who have ever tried recreational drugs. To this end, we identify all resident adults and send out a questionnaire to 3011 randomly selected adults. On it are two questions, analogs of the Q1a and QualityB questions discussed earlier:

  1. What was your age on January 1, 2012?

  2. If you answered 55 or older to question 1, have you ever knowingly consumed a drug, for recreational purposes, that at the time either required a physician's prescription or was illegal to use or sell in the U.S.?

Miraculously--perhaps through incredibly diligent followup--you receive valid responses on all 3011 questionnaires. The data are:

  • 616 of the responses are ages 55 or older.

  • Of those 616, 34 answered "yes" to the second question.

What proportion should you estimate? Is there any valid way to estimate a proportion at all?

One form of the bootstrap studies this problem by adopting a synthetic population having exactly the same proportions observed in the data and recreates the experiment and its analysis many, many times, independently. Here is reproducible R code to do that for 100,000 independent trials, using the Binomial estimate recommended above:

trial <- function(n.trials, n=1, p1=1/2, p2=1/2) {
  x <- rmultinom(n.trials, n, c(p2,1-p2) %o% c(p1,1-p1))
  m <- x[1,]+x[2,] # Total who answer the second question
  mean <- x[1,]/m  # Proportion of "yeses" in the second question
  se <- sqrt(mean*(1-mean)/m)
  rbind(mean, se)  # Estimate and standard error of the estimate for each trial
}
set.seed(17)
sim <- trial(100000, 3031, 616/3031, 34/616)

The average estimate, mean(sim[1,), is $0.0551537$: almost identical to the correct value of $34/616 \approx 0.0551948$ in the synthetic population. There is no bias.

How about the approximate confidence interval procedure? We can check in each of the 100,000 trials whether the confidence interval covered the true value of $34/616$ or not:

coverage.upper <- sim[2,] * 1.65 + sim[1,] > 34/616
coverage.lower <- -sim[2,] * 1.65 + sim[1,] < 34/616
(sum(coverage.upper) + sum(coverage.lower))/100000 - 1

The result, $0.89542$, is within one-half of one percent of the desired coverage of $0.90$: that's excellent, especially given the approximations that were made.

Given these (hypothetical) data we may legitimately conclude, then, that approximately $5.52$% of all U.S. senior citizens have used recreational drugs. With $90$% confidence that proportion is between $4$% and $7$%.

Related Question