How do I calculate the sample mean given these variables (confidence interval, confidence level and sample size)? I have everything except (obviously) the sample mean, and I also lack the standard deviation seemingly.
[Math] Calculating sample mean from confidence interval, confidence level and sample size
statistics
Related Solutions
It's not surprising you're a bit confused; understanding what's really going on with confidence intervals can be tricky.
The short version: If you don't want to check all the files you have to choose two different percentages: the confidence level (95% in your example), and how far off you're willing to be at that level (20% in your example). These percentages refer to two different quantities, and so it doesn't make sense to add or subtract them from each other. Once you've made these choices then I think it is fine to use the online calculator to get a sample size.
If you want more detail on what's going on, here's the explanation: You're trying to estimate the true percentage of files that have correct data. Let's call that percentage $p$. Since you don't want to calculate $p$ exactly, you have to choose how far off you are willing to be with your estimate, say 20%. Unfortunately, you can't even be certain that your estimate of $p$ will be within 20%, so you have to choose a level of confidence that that estimate will be within 20% of $p$. You have chosen 95%. Then the online calculator gives you the sample size of 23 you need to estimate $p$ to within 20% at 95% confidence.
But what does that 95% really mean? Basically, it means that if you were to choose lots and lots of samples of size 23 and calculate a confidence interval from each one, 95% of the resulting confidence intervals would actually contain the unknown value of $p$. The other 5% would give an interval of some kind that does not include $p$. (Some would be too large, others would be too small.) Another way to look at it is that choosing a 95% confidence interval means that you're choosing a method that gives correct results (i.e., produces a confidence interval that actually contains the value of $p$) 95% of the time.
To answer your specific questions:
"Does that mean that 'I can be 95% confident that 80% to 100% of the files are correct'?" Not precisely. It really means that you can be 95% confident that the true percentage of correct files is between 80% and 100%. That's a subtle distinction.
"And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)?" No, this is confusing the two kinds of percentages. The 99% refers to 99% of all confidence intervals constructed if you were to construct lots of them. The 4% refers to an error margin of $\pm$ 4% for the files.
One other thing to remember is that the sample size estimator assumes that the population you're drawing from is much, much larger than the size of the sample you end up going with. Since your population is fairly small you can get away with a smaller-sized sample with the same level of confidence. The determination of exactly how small, though, is a much more difficult calculation. It's beyond what you would have seen in a basic statistics class. I'm not sure how to do it; maybe someone else on the site does. (EDIT: Even better: take Jyotirmoy Bhattacharya's suggestion and ask on Stats Stack Exchange.) But this is the only justification for being able to use a smaller sample size than 23 - not the fact that you would abort the confidence interval calculation if you found anything other than 100% for your sample's estimate of the true value of $p$.
The notion of confidence interval is somewhat intuitive, but that may be keeping you from understanding what it means in more depth.
Say I have multiple samples $x_i$ from a population, and I wish to estimate the population mean $\mu$. A CI of, say, 95\% represents an interval of possible values of $\mu$ such that given my samples, the "probability" that the $\mu$ lies in that interval is 95%.
We immediately see that there can be more than one such interval, since I could trade probability past the upper end for probability at the lower end of the interval, thus shifting the interval. Let's skirt that issue by demanding a symmetric interval about my sample mean.
But the "probability" is not well defined from the information I just presented!
In order to assign a probability, I have to make some assumptions about the population. The usual assumption is that the population variance is equal to the unbiased estimator of variance obtained from our sample. But we still have things backward: We can't honestly talk about the probability of the population mean being in some range, without any assumption about the a priori (before I saw my samples) probabilities of the mean being various values.
So we apply the usual sleight-of-mind logic employed by the frequentist point of view. We ask:
Given that the population variance is our unbiased sample variance estimate, What are the highest and lowest values of the population mean $\mu$ such that the chance of or sample being as far away from $\mu$ as it is, is lower than 100%-95% = 5%.
Now let's go back to your problem. Since the population is finite, as you draw more samples (without replacement) you actually do learn something about the population. If you had drawn all the objects but one, if you take your sample unbiased variance as the population variance, your 95% confidence interval for the value of that remaining one object would be roughly $2\sigma$ but your estimate of the population mean will have a variance of $\sigma/N$. This is quite a bit smaller than would be the case for an infinite population or a small sampling of a large population.
Now when you draw that last sample, you know everything about the distribution. In particular, you know the mean exactly. Therefore any interval that includes the actual mean is a 100% CI. If you then say that the real CI is the tightest such interval, then it has width zero.
Best Answer
Generally the sample mean is the midpoint of the confidence interval.