Sample Size Calculation – Determining Sample Size for Given Confidence Interval and Margin of Error

samplesample-size

I would like to calculate my ideal sample using the Qualtrics calculator:

https://www.qualtrics.com/de/erlebnismanagement/marktforschung/stichprobenrechner/

For this, I need a confidence level and a margin of error. Do the confidence interval and the margin of error together have to add up to 100%?

Best Answer

In order to find the required sample size $n,$ you need a confidence level (such as $.95 = 95\%)$ and a margin of error (such as $\pm .03 = \pm 3\%).$

So that an explicit answer to your question doesn't get lost in a longer explanation of confidence intervals: No. There's no restriction that the confidence level and the margin of error must add to $1.$

The calculator in the link also asks for a population size, but that is not important unless you're thinking you might sample more than 10% of the population. So if this is for a nationwide poll in a large country with millions of eligible subjects, you can ignore that part. (If you're using the calculator in the link, you'd enter something like $10\,000\,000).$

The margin of error for a 95% confidence interval from a poll is $\pm 1.96\sqrt{\frac{p(1-p)}{n}},$ where $n$ is the sample size and $p$ is the true population proportion with the relevant attribute (such as favoring Proposition A on in an upcoming election).

The margin of error is the proportion (percentage in your link) that determines the width of your confidence interval. Maybe you'd like to say that the true proportion is $0.55 \pm 0.03$ or $55\% \pm 3\%.$ Then $E = .03 = 3\%.$

Not knowing $p,$ you could either guess what $p$ might be, or take the worst case, which is $p = 1/2$ (giving the largest possible margin of error). Then for a 95% confidence interval (CI), you'd have a CI of the form $\hat p \pm E.$ So $E=1.96\sqrt{\frac{p(1-p)}{n}}.$ If you're taking $p = 1/2,$ then you have $E = 1.96\sqrt{.25/n} \approx 1/\sqrt{n}.$ So, if $E = 3\%,$ then $n \approx 1/(.03^2) = 1111$ subjects.

Note: Here's why I say that $p = 1/2$ is the 'worst case', leading to the largest margin of error. The factor $Q = p(1 - p)$ in the margin of error reaches its maximum when $p = 1/2.$ So the margin of error $E$ is maximized when $p = 1/2$ and for a fixed value of $E$ that leads to the largest required $n.$

plot(p, Q, type="l", lwd=2)
 abline(v = 1/2, col="green2")

Related Solutions

Confidence Interval – Determining Sample Sizes for Binomial Confidence Intervals

(1) Yes.

(2) Yes. There are only $n+1$ possible outcomes for a binomial random variable, so it is possible to look at what happens for each possible outcome - in fact this is faster than simulating lots and lots of outcomes!

Let $X$ be the number of "successes" among the $n$ customers and let $\hat{p}=X/n$. The confidence interval is $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$, so the halfwidth is $z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$. Thus we want to compute $P(z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\leq 0.005)$. In R, we can do this as follows:

target.halfWidth<-0.005

p<-0.016 #true proportion
n.vec<-seq(from=1000, to=3000, by=100) #number of samples

# Vector to store results
prob.hw<-rep(NA,length(n.vec))

# Loop through desired sample size options
for (i in 1: length(n.vec))
{
n<-n.vec[i]

# Look at all possible outcomes
x<-0:n
p.est<-x/n

# Compute halfwidth for each option
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)

# What is the probability that the halfwidth is less than 0.005?
prob.hw[i]<-sum({halfWidth<=target.halfWidth}*dbinom(x,n,p))
}

# Plot results
plot(n.vec,prob.hw,type="b")
abline(0.95,0,col=2)

# Get the minimal n required
n.vec[min(which(prob.hw>=0.95))]

The answer is $n=2200$ in this case as well.

Finally, it is usually a good idea to verify that the asymptotic normal approximation interval actually gives the desired coverage. In R, we can compute the coverage probability (i.e. the actual confidence level) as:

p<-0.016
n<-2200
x<-0:n
p.est<-x/n
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)
# Coverage probability
sum({abs(p-p.est)<=halfWidth}*dbinom(x,n,p))

Different $p$ give different coverages. For $p$ around $0.015$, the actual confidence level of the nominal $90\%$ interval seems to be about $89\%$ in general, which I presume is fine for your purposes.

(3) When you sample from a finite population, the number of successes is not binomial but hypergeometric. If the population is large compared to your sample size, the binomial works just fine as an approximation. If you sample 1000 out of 5000, say, it does not. Have a look at confidence intervals for proportions based on the hypergeometric distribution!

Answers to additional questions:

Let $(p_L,p_U)$ be the confidence interval.

1) In that case you are no longer computing $P(p_L-p_U\leq0.01)$ but $$P\Big(p_L-p_U\leq0.01~\mbox{and}~p\in(p_L,p_U)\Big),$$ i.e. the probability that the length of intervals that actually contain $p$ is at most 0.01. This may be an interesting quantity, depending on what you're interested in...

2) Maybe, but probably not. If the population size is large compared to the sample size you don't need it, and if it's not then the binomial distribution is not appropriate to begin with!

3) Sprop seems to contain confidence intervals based on the hypergeometric intervals, so that should work just fine.

Solved – Confidence intervals vs sample size

In addition to Peter's great answer, here are some answers to your specific questions:

Who to trust will depend also on who is doing the poll and what effort they put into getting a good quality poll. A bigger sample size is not better if the sample is not representative, taking a huge poll, but only in one, non-swing state would not give very good results.

There is a relationship between sample size and the width of the confidence interval, but other things also influence the width, such as how close the percentage is to 0, 1, or 0.5; what bias adjustments were used, how the sample was taken (clustering, stratification, etc.). The general rule is that the width of the confidence interval will be proportional to $\frac{1}{\sqrt{n}}$, so to halve the interval you need 4 times the sample size.
If you know enough about how the sample was collected and what formula was used to compute the interval then you could solve for the standard deviation (you also need to know the confidence level being used, usually 0.05). But the formula is different for stratified vs. cluster samples. Also most polls look at percentages, so would use the binomial distribution.
There are ways to combine the information, but you would generally need to know something about how the samples were collected, or be willing to make some form of assumptions about how the intervals were constructed. A Bayesian approach is one way.

Best Answer

Related Solutions

Confidence Interval – Determining Sample Sizes for Binomial Confidence Intervals

Solved – Confidence intervals vs sample size

Related Question