Solved – Sample size for binomial confidence interval

binomial distributionconfidence intervalsample-sizesampling

Background:

A group at work is sampling 1,000 customers to contact and from that point, determining if the effort is worthwhile or not. I wanted to see if this (pretty much) arbitrary sample size value was at all "good enough".

If we assume that the true proportion of successes in a population is 0.016 (1.6%) how large a sample size would I need to take in order to get a confidence interval margin of error "half width") of say 0.005 (0.5%)? Here is my approach in R:

install.packages("Hmisc")
library(Hmisc)

target.halfWidth <- 0.005
sims <- 25000 #number of draws from binomial to perform    

p <- 0.016 #true proportion
n <- seq(from=500, to=5000, by=100) #number of samples

#hold results
results <- matrix(numeric(0), length(n),2)

#loop through desired sample size options
for (i in 1: length(n))
{    
x <- rbinom(sims, n[i], p) #draws from binomial with p and n 
ci <- binconf(x, n[i] ,method="asymptotic", alpha=0.1) #normal theory 90% CI
half_width <- ci[,3]-ci[, 1] #half width of CI

#need the number where the half width is within the target range
prob.halfWidth <- length(half_width[half_width<target.halfWidth])/sims

#store results
results[i, 1] <- n[i]
results[i, 2] <- prob.halfWidth
}

#plot
plot(results[, 2], results[, 1], type="b")
results 

This simulation showed that we would need 2,200 samples to be 95% confident that a 90% CI would be at most 0.005.

Questions:

  1. Is this a proper method?

  2. Are there better ways?

  3. What advice can you give is there are finite samples of some sub-populations? Say that we want to know how many samples to take of populations where there are not "a lot" of customers to choose from. Maybe there is only 5,000 of a certain group, should we not be able to take less of them to make a determination compared to a group where there are 50,000 to choose from?

Added after MansT answer:

  1. Would it make sense, under my scenario with the simulation draws to add a step where this line:

    prob.halfWidth <- length(half_width[half_width

    only increments the numerator when the resulting CI also contains the true p (i.e. 0.016)?

  2. Under your code, would it also be appropriate when dealing with a finite sample to add the finite population corrector FPC to your line:

    halfWidth <- qnorm(0.95)sqrt(p.est(1-p.est)/n)

  3. I am not sure the formula for a CI from the hypergeometric, but perhaps I could replace my code line

    ci <- binconf(x,n[i],method="asymptotic",alpha=0.1) #normal theory 90% CI

with Sprop function in R?

Best Answer

(1) Yes.

(2) Yes. There are only $n+1$ possible outcomes for a binomial random variable, so it is possible to look at what happens for each possible outcome - in fact this is faster than simulating lots and lots of outcomes!

Let $X$ be the number of "successes" among the $n$ customers and let $\hat{p}=X/n$. The confidence interval is $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$, so the halfwidth is $z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$. Thus we want to compute $P(z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\leq 0.005)$. In R, we can do this as follows:

target.halfWidth<-0.005

p<-0.016 #true proportion
n.vec<-seq(from=1000, to=3000, by=100) #number of samples

# Vector to store results
prob.hw<-rep(NA,length(n.vec))

# Loop through desired sample size options
for (i in 1: length(n.vec))
{
n<-n.vec[i]

# Look at all possible outcomes
x<-0:n
p.est<-x/n

# Compute halfwidth for each option
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)

# What is the probability that the halfwidth is less than 0.005?
prob.hw[i]<-sum({halfWidth<=target.halfWidth}*dbinom(x,n,p))
}

# Plot results
plot(n.vec,prob.hw,type="b")
abline(0.95,0,col=2)

# Get the minimal n required
n.vec[min(which(prob.hw>=0.95))]

The answer is $n=2200$ in this case as well.

Finally, it is usually a good idea to verify that the asymptotic normal approximation interval actually gives the desired coverage. In R, we can compute the coverage probability (i.e. the actual confidence level) as:

p<-0.016
n<-2200
x<-0:n
p.est<-x/n
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)
# Coverage probability
sum({abs(p-p.est)<=halfWidth}*dbinom(x,n,p))

Different $p$ give different coverages. For $p$ around $0.015$, the actual confidence level of the nominal $90\%$ interval seems to be about $89\%$ in general, which I presume is fine for your purposes.

(3) When you sample from a finite population, the number of successes is not binomial but hypergeometric. If the population is large compared to your sample size, the binomial works just fine as an approximation. If you sample 1000 out of 5000, say, it does not. Have a look at confidence intervals for proportions based on the hypergeometric distribution!

Answers to additional questions:

Let $(p_L,p_U)$ be the confidence interval.

1) In that case you are no longer computing $P(p_L-p_U\leq0.01)$ but $$P\Big(p_L-p_U\leq0.01~\mbox{and}~p\in(p_L,p_U)\Big),$$ i.e. the probability that the length of intervals that actually contain $p$ is at most 0.01. This may be an interesting quantity, depending on what you're interested in...

2) Maybe, but probably not. If the population size is large compared to the sample size you don't need it, and if it's not then the binomial distribution is not appropriate to begin with!

3) Sprop seems to contain confidence intervals based on the hypergeometric intervals, so that should work just fine.

Related Question