[Math] Calculating a sample size based on a confidence level

probabilitystatistics

It's been a while since my last statistics class…

I have $404$ files that went through some automated generation process. I would like to manually verify some of them to make sure that their data is indeed correct. I want to use probability to help me out so that I don't need to check every single file.

How would I calculate what sample size I should use to reach a certain confidence level?

For example, if I would like to say with $95\%$ confidence that the files are correct, how many of them do I have to check?

I found an online calculator, but I'm not entirely sure what I should put for the confidence interval. Say I put 20% and leave the confidence factor at $95\%$. I get a sample size of 23. Let's say now that I tested 23 random files and all of them were fine. Does that mean that "I can be $95\%$ confident that 80% to 100% of the files are correct"?

Does this mean, then, that for my original question, I would need to use a 99% confidence level with a 4% confidence interval, then I would need to verify that the 291 files (the sample size it gave me) are all correct. And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)

It also mentions something about percentages which I'm not quite clear on… does the fact that most (i.e. 100%) of the files I test are valid (since if I found an invalid one, I would stop the whole process and examine my generation process for errors) mean that I can use a smaller sample to get the same confidence factor? If so, how would I calculate it?

Best Answer

It's not surprising you're a bit confused; understanding what's really going on with confidence intervals can be tricky.

The short version: If you don't want to check all the files you have to choose two different percentages: the confidence level (95% in your example), and how far off you're willing to be at that level (20% in your example). These percentages refer to two different quantities, and so it doesn't make sense to add or subtract them from each other. Once you've made these choices then I think it is fine to use the online calculator to get a sample size.

If you want more detail on what's going on, here's the explanation: You're trying to estimate the true percentage of files that have correct data. Let's call that percentage $p$. Since you don't want to calculate $p$ exactly, you have to choose how far off you are willing to be with your estimate, say 20%. Unfortunately, you can't even be certain that your estimate of $p$ will be within 20%, so you have to choose a level of confidence that that estimate will be within 20% of $p$. You have chosen 95%. Then the online calculator gives you the sample size of 23 you need to estimate $p$ to within 20% at 95% confidence.

But what does that 95% really mean? Basically, it means that if you were to choose lots and lots of samples of size 23 and calculate a confidence interval from each one, 95% of the resulting confidence intervals would actually contain the unknown value of $p$. The other 5% would give an interval of some kind that does not include $p$. (Some would be too large, others would be too small.) Another way to look at it is that choosing a 95% confidence interval means that you're choosing a method that gives correct results (i.e., produces a confidence interval that actually contains the value of $p$) 95% of the time.

To answer your specific questions:

"Does that mean that 'I can be 95% confident that 80% to 100% of the files are correct'?" Not precisely. It really means that you can be 95% confident that the true percentage of correct files is between 80% and 100%. That's a subtle distinction.

"And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)?" No, this is confusing the two kinds of percentages. The 99% refers to 99% of all confidence intervals constructed if you were to construct lots of them. The 4% refers to an error margin of $\pm$ 4% for the files.

One other thing to remember is that the sample size estimator assumes that the population you're drawing from is much, much larger than the size of the sample you end up going with. Since your population is fairly small you can get away with a smaller-sized sample with the same level of confidence. The determination of exactly how small, though, is a much more difficult calculation. It's beyond what you would have seen in a basic statistics class. I'm not sure how to do it; maybe someone else on the site does. (EDIT: Even better: take Jyotirmoy Bhattacharya's suggestion and ask on Stats Stack Exchange.) But this is the only justification for being able to use a smaller sample size than 23 - not the fact that you would abort the confidence interval calculation if you found anything other than 100% for your sample's estimate of the true value of $p$.