Confidence Interval – How to Determine the Correct Sample Size for Calculating CI in a Subset

binomial distributionconfidence intervalsurvey

Suppose I have a study of 3031 people that obtains responses to various questions (95% CL). One of the questions (Q1a) gets a yes response from 616 of the people. Of those 616, only 34 have a QualityB (the remaining 582 don't) that is dependent on a yes answer to (Q1a).

When determining the margin of error for QualityB (Confidence Interval) using this online calculator, do I use 34 or 616 for the sample size? I think I should use 616 for the sample size, 3031 for population size, and 5% for percentage, but I am not sure.

Using 616 as the sample size, result is +/- 1.54%.
Using 34 as the sample size, result is +/- 7.29%.

Best Answer

As I interpret the question, this problem can easily be resolved through some careful reasoning about what is going on and what the survey objectives are. We can even do some simple mental calculations as a check on the confidence interval produced by the software.

Consider this model of the survey. In a population, members have the following attributes: their answer to Q1a ("yes" or "no") and their QualityB (present or absent). However, a QualityB value is available and meaningful only for those who answered "yes". One purpose of the survey is to estimate the proportion of "yes" answerers who have QualityB present. To this end, the survey has selected 3031 people independently and randomly from the population.

If this model reasonably approximates the survey and its objective, then notice that the randomization procedure by which all 3031 people were selected constitutes a fortiori a random procedure to select among the "yes" answerers. However (unlike specifying the sample size of 3031, which is usually determined by the investigator), the number of "yes" answerers was not determined in advance: it, too, is a random quantity.

Nevertheless, in part because the subsample size of 616 is so large, it is a reasonable approximation to analyze it as if it were a random sample of 616 people chosen from just the "yes" answerers in the population. (As a partial justification of why we can consider 616 "large," by using the approximate Binomial theory calculations below, one can figure out that a comparable survey would likely have included between 579 and 653 "yes" answerers; this amount of variation would not change the calculation of the confidence limits much at all.) Accordingly, the analysis of the 34 people with QualityB present can proceed as usual.

The binomial theory is applicable: we estimate the proportion of QualityB people out of the "yes" answerers as $34/616$ = $5.5$% and we estimate the variance of that proportion as $(34/616)(1 - 34/616)$, for an estimated standard deviation of $0.22836$. Because the subsample size is $616$, the standard error of the proportion is $0.22836/\sqrt{616}$ = $0.92$%. So--just to see where this is headed--we could use a normal approximation as a rough check. This tells us to expect the confidence interval procedure to give us a range from about $5.5$ - $1.65 \times 0.92$ = $4.0$% to $5.5 + 1.65 \times 0.92$ = $7.0$%. This range is very close to the quoted value of $5.5 \pm 1.54$%. (The multiplier of $1.65$ should give, approximately, a $90$% two-sided confidence interval.)

We conclude that the proportion of people with QualityB among all "yes" answerers in the population is likely to be between $4$% and $7$%. To deduce this, we have used a procedure that will mislead us (by the luck of the draw) at most $10$% of the time it is appropriately applied; that's where our "confidence" comes from.

Edit

Because some questions about the validity of this answer have been raised in comments, let's check. One way is to bootstrap the data to assess the bias. But before proceeding, let's recast the problem in a more concrete form.

Suppose, then, we are interested in the proportion of U.S. senior citizens (defined, say, as age 55 or older on January 1, 2012 and were resident in the U.S. on that date) who have ever tried recreational drugs. To this end, we identify all resident adults and send out a questionnaire to 3011 randomly selected adults. On it are two questions, analogs of the Q1a and QualityB questions discussed earlier:

What was your age on January 1, 2012?
If you answered 55 or older to question 1, have you ever knowingly consumed a drug, for recreational purposes, that at the time either required a physician's prescription or was illegal to use or sell in the U.S.?

Miraculously--perhaps through incredibly diligent followup--you receive valid responses on all 3011 questionnaires. The data are:

616 of the responses are ages 55 or older.
Of those 616, 34 answered "yes" to the second question.

What proportion should you estimate? Is there any valid way to estimate a proportion at all?

One form of the bootstrap studies this problem by adopting a synthetic population having exactly the same proportions observed in the data and recreates the experiment and its analysis many, many times, independently. Here is reproducible R code to do that for 100,000 independent trials, using the Binomial estimate recommended above:

trial <- function(n.trials, n=1, p1=1/2, p2=1/2) {
  x <- rmultinom(n.trials, n, c(p2,1-p2) %o% c(p1,1-p1))
  m <- x[1,]+x[2,] # Total who answer the second question
  mean <- x[1,]/m  # Proportion of "yeses" in the second question
  se <- sqrt(mean*(1-mean)/m)
  rbind(mean, se)  # Estimate and standard error of the estimate for each trial
}
set.seed(17)
sim <- trial(100000, 3031, 616/3031, 34/616)

The average estimate, mean(sim[1,), is $0.0551537$: almost identical to the correct value of $34/616 \approx 0.0551948$ in the synthetic population. There is no bias.

How about the approximate confidence interval procedure? We can check in each of the 100,000 trials whether the confidence interval covered the true value of $34/616$ or not:

coverage.upper <- sim[2,] * 1.65 + sim[1,] > 34/616
coverage.lower <- -sim[2,] * 1.65 + sim[1,] < 34/616
(sum(coverage.upper) + sum(coverage.lower))/100000 - 1

The result, $0.89542$, is within one-half of one percent of the desired coverage of $0.90$: that's excellent, especially given the approximations that were made.

Given these (hypothetical) data we may legitimately conclude, then, that approximately $5.52$% of all U.S. senior citizens have used recreational drugs. With $90$% confidence that proportion is between $4$% and $7$%.

Related Solutions

Confidence Interval – Determining Sample Sizes for Binomial Confidence Intervals

(1) Yes.

(2) Yes. There are only $n+1$ possible outcomes for a binomial random variable, so it is possible to look at what happens for each possible outcome - in fact this is faster than simulating lots and lots of outcomes!

Let $X$ be the number of "successes" among the $n$ customers and let $\hat{p}=X/n$. The confidence interval is $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$, so the halfwidth is $z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$. Thus we want to compute $P(z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\leq 0.005)$. In R, we can do this as follows:

target.halfWidth<-0.005

p<-0.016 #true proportion
n.vec<-seq(from=1000, to=3000, by=100) #number of samples

# Vector to store results
prob.hw<-rep(NA,length(n.vec))

# Loop through desired sample size options
for (i in 1: length(n.vec))
{
n<-n.vec[i]

# Look at all possible outcomes
x<-0:n
p.est<-x/n

# Compute halfwidth for each option
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)

# What is the probability that the halfwidth is less than 0.005?
prob.hw[i]<-sum({halfWidth<=target.halfWidth}*dbinom(x,n,p))
}

# Plot results
plot(n.vec,prob.hw,type="b")
abline(0.95,0,col=2)

# Get the minimal n required
n.vec[min(which(prob.hw>=0.95))]

The answer is $n=2200$ in this case as well.

Finally, it is usually a good idea to verify that the asymptotic normal approximation interval actually gives the desired coverage. In R, we can compute the coverage probability (i.e. the actual confidence level) as:

p<-0.016
n<-2200
x<-0:n
p.est<-x/n
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)
# Coverage probability
sum({abs(p-p.est)<=halfWidth}*dbinom(x,n,p))

Different $p$ give different coverages. For $p$ around $0.015$, the actual confidence level of the nominal $90\%$ interval seems to be about $89\%$ in general, which I presume is fine for your purposes.

(3) When you sample from a finite population, the number of successes is not binomial but hypergeometric. If the population is large compared to your sample size, the binomial works just fine as an approximation. If you sample 1000 out of 5000, say, it does not. Have a look at confidence intervals for proportions based on the hypergeometric distribution!

Answers to additional questions:

Let $(p_L,p_U)$ be the confidence interval.

1) In that case you are no longer computing $P(p_L-p_U\leq0.01)$ but $$P\Big(p_L-p_U\leq0.01~\mbox{and}~p\in(p_L,p_U)\Big),$$ i.e. the probability that the length of intervals that actually contain $p$ is at most 0.01. This may be an interesting quantity, depending on what you're interested in...

2) Maybe, but probably not. If the population size is large compared to the sample size you don't need it, and if it's not then the binomial distribution is not appropriate to begin with!

3) Sprop seems to contain confidence intervals based on the hypergeometric intervals, so that should work just fine.

Solved – Confidence Interval of Categorical Data with Multiple responses

If there is just one category of interest, e.g. lemons, there's no problem with "binarizing" it and extracting a CI for the population proportion who like lemons. However, if you do this for each category, you might end up with something misleading because the numbers of responses for each category are dependent and you are making multiple probability statements, and so the probability that each proportion lies in your interval could be very different than you might expect. You could report a "confidence region," or "simultaneous confidence intervals" that capture this dependency better. See Sison and Glaz.

Best Answer

Edit

Related Solutions

Confidence Interval – Determining Sample Sizes for Binomial Confidence Intervals

Solved – Confidence Interval of Categorical Data with Multiple responses

Related Question