Solved – 95% confidence interval for mean of a large sample

confidence intervalr

I have a large sample of experimental observations for different categories (specifically, the runtime of an algorithm in different scenarios). I want to plot the mean runtime for each category/scenario and also show the 95% confidence interval using R.

According to the central limit theorem, the mean of each category should be normally distributed (because it is based on a large number of independent observations).

I know how to plot the means as scatter plot and how to add error bars. I'm just unsure about the 95% confidence interval. The 95% confidence interval is the interval in which a new value lays with 95% probability? Or is only the actual mean in the interval with 95% probability?

I found this code on calculating the confidence interval:

error <- qnorm(0.975)*sd/sqrt(n)

Where n is the sample size and sd is the standard deviation. Unfortunately, it lacks further explanation. What exactly is qnorm(0.975) and why do we choose 0.975 to get the 95% confidence interval?

Best Answer

qnorm is the quantile function for the normal distribution. More details are available by typing ?qnorm. You pick 0.975 to get a two-sided confidence interval. This gives 2.5% of the probability in the upper tail and 2.5% in the lower tail, as in the picture.

Related Solutions

Solved – Sample size for binomial confidence interval

(1) Yes.

(2) Yes. There are only $n+1$ possible outcomes for a binomial random variable, so it is possible to look at what happens for each possible outcome - in fact this is faster than simulating lots and lots of outcomes!

Let $X$ be the number of "successes" among the $n$ customers and let $\hat{p}=X/n$. The confidence interval is $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$, so the halfwidth is $z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$. Thus we want to compute $P(z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}\leq 0.005)$. In R, we can do this as follows:

target.halfWidth<-0.005

p<-0.016 #true proportion
n.vec<-seq(from=1000, to=3000, by=100) #number of samples

# Vector to store results
prob.hw<-rep(NA,length(n.vec))

# Loop through desired sample size options
for (i in 1: length(n.vec))
{
n<-n.vec[i]

# Look at all possible outcomes
x<-0:n
p.est<-x/n

# Compute halfwidth for each option
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)

# What is the probability that the halfwidth is less than 0.005?
prob.hw[i]<-sum({halfWidth<=target.halfWidth}*dbinom(x,n,p))
}

# Plot results
plot(n.vec,prob.hw,type="b")
abline(0.95,0,col=2)

# Get the minimal n required
n.vec[min(which(prob.hw>=0.95))]

The answer is $n=2200$ in this case as well.

Finally, it is usually a good idea to verify that the asymptotic normal approximation interval actually gives the desired coverage. In R, we can compute the coverage probability (i.e. the actual confidence level) as:

p<-0.016
n<-2200
x<-0:n
p.est<-x/n
halfWidth<-qnorm(0.95)*sqrt(p.est*(1-p.est)/n)
# Coverage probability
sum({abs(p-p.est)<=halfWidth}*dbinom(x,n,p))

Different $p$ give different coverages. For $p$ around $0.015$, the actual confidence level of the nominal $90\%$ interval seems to be about $89\%$ in general, which I presume is fine for your purposes.

(3) When you sample from a finite population, the number of successes is not binomial but hypergeometric. If the population is large compared to your sample size, the binomial works just fine as an approximation. If you sample 1000 out of 5000, say, it does not. Have a look at confidence intervals for proportions based on the hypergeometric distribution!

Answers to additional questions:

Let $(p_L,p_U)$ be the confidence interval.

1) In that case you are no longer computing $P(p_L-p_U\leq0.01)$ but $$P\Big(p_L-p_U\leq0.01~\mbox{and}~p\in(p_L,p_U)\Big),$$ i.e. the probability that the length of intervals that actually contain $p$ is at most 0.01. This may be an interesting quantity, depending on what you're interested in...

2) Maybe, but probably not. If the population size is large compared to the sample size you don't need it, and if it's not then the binomial distribution is not appropriate to begin with!

3) Sprop seems to contain confidence intervals based on the hypergeometric intervals, so that should work just fine.

Solved – Confidence Intervals Around a Mean: biased (non-centered) confidence interval? (an exercise using R)

First of all, I agree with the comments left by heropup. I'll add some details.

The reason why your simulation breaks down may be a little subtle. At least I spend some time reading your code to find the source of the problem. Please notice, that you only simulate once for each of the cases. Then your CIs functions resample this initial data set. This clearly gives a lot of dependence between all of the samples. For instance, if you draw a sample of 1000 of the original data set, there is only one way to do this. If you draw a sample of 999 an overwhelming majority of the data set will still be the same between resamples. You'll need to do independent resampling. Otherwise, the 100 samples are essentially the same when you let $n$ get large.

Turning to your question, a confidence interval as the ones you do above are based on a distributional assumption, for instance your observations are normally distributed. If that's the case the confidence interval will be 'centered' in the sense you talk about when you construct the confidence interval symmetrically. This is evident from the symmetry of the distribution and of the procedure of constructing a confidence interval. In the above, you also calculate a confidence interval when the distributional assumption you make is not correct. Then an confidence interval need not be centered, even if the distribution assumed is symmetric. This can be seen simulated observations from a chi squared distribution and calculating confidence intervals based on a normal distribution. However, using a central limit theorem we can argue that the mean of the chi squared observations will be approximately normally distributed for large enough sample sizes.

Finally, I just want to note that a confidence interval (or more generally, a confidence set) is basically and loosely speaking just some subset of a parameter set such that when you calculate this set a lot of times (hypothetically) it will contain the true parameter value for example 95% of the times. There's no claim of this set being 'centered' or symmetric around a parameter estimate. It can be chosen to have all sorts of strange forms. This is just not very intuitive and most of the time not very helpful.

Best Answer

Related Solutions

Solved – Sample size for binomial confidence interval

Solved – Confidence Intervals Around a Mean: biased (non-centered) confidence interval? (an exercise using R)

Related Question