What's the simplest formula to calculate the optimal sample size from a population size? I've been reading a little about it and they ask me stuff like confidence level or confidence interval which I have no idea about. I need to test some HTML elements in web pages to get a better idea about general properties of HTML elements, and I have no idea about he results I'll get. So I just need the most neutral possible version of the sample size calculating formula when all I know is the population size.
[Math] Optimal sample size
statistics
Related Solutions
It's not surprising you're a bit confused; understanding what's really going on with confidence intervals can be tricky.
The short version: If you don't want to check all the files you have to choose two different percentages: the confidence level (95% in your example), and how far off you're willing to be at that level (20% in your example). These percentages refer to two different quantities, and so it doesn't make sense to add or subtract them from each other. Once you've made these choices then I think it is fine to use the online calculator to get a sample size.
If you want more detail on what's going on, here's the explanation: You're trying to estimate the true percentage of files that have correct data. Let's call that percentage $p$. Since you don't want to calculate $p$ exactly, you have to choose how far off you are willing to be with your estimate, say 20%. Unfortunately, you can't even be certain that your estimate of $p$ will be within 20%, so you have to choose a level of confidence that that estimate will be within 20% of $p$. You have chosen 95%. Then the online calculator gives you the sample size of 23 you need to estimate $p$ to within 20% at 95% confidence.
But what does that 95% really mean? Basically, it means that if you were to choose lots and lots of samples of size 23 and calculate a confidence interval from each one, 95% of the resulting confidence intervals would actually contain the unknown value of $p$. The other 5% would give an interval of some kind that does not include $p$. (Some would be too large, others would be too small.) Another way to look at it is that choosing a 95% confidence interval means that you're choosing a method that gives correct results (i.e., produces a confidence interval that actually contains the value of $p$) 95% of the time.
To answer your specific questions:
"Does that mean that 'I can be 95% confident that 80% to 100% of the files are correct'?" Not precisely. It really means that you can be 95% confident that the true percentage of correct files is between 80% and 100%. That's a subtle distinction.
"And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)?" No, this is confusing the two kinds of percentages. The 99% refers to 99% of all confidence intervals constructed if you were to construct lots of them. The 4% refers to an error margin of $\pm$ 4% for the files.
One other thing to remember is that the sample size estimator assumes that the population you're drawing from is much, much larger than the size of the sample you end up going with. Since your population is fairly small you can get away with a smaller-sized sample with the same level of confidence. The determination of exactly how small, though, is a much more difficult calculation. It's beyond what you would have seen in a basic statistics class. I'm not sure how to do it; maybe someone else on the site does. (EDIT: Even better: take Jyotirmoy Bhattacharya's suggestion and ask on Stats Stack Exchange.) But this is the only justification for being able to use a smaller sample size than 23 - not the fact that you would abort the confidence interval calculation if you found anything other than 100% for your sample's estimate of the true value of $p$.
Is your formula the same as this one:
$$ n = \left( \frac{z_{\alpha/2}\sigma }{E} \right)^2 $$
with $z_{\alpha/2} = 1.96$ for a $95\%$ confidence interval and $E=2.5$ is your margin of error?
If you think that the question has given you the sample variance ($s^2=25$), then we can estimate the population variance ($\sigma^2$) using:
$$ \begin{align} \sigma^2 &= \frac{n_0}{n_0-1}s^2 \end{align}$$
where $n_0=10$ in your case.
(You might have come across this estimation before. If not, try Googling it: "estimating the population variance from the sample variance".)
Using this estimation you get $n=17.07$, rounded up to $n=18$. Where did you get the answer $20$ from? It seems wrong to me...
Best Answer
It depends on your population and the confidence that you need in your results. Typically, you should know something about what you are trying to achieve. If you know your confidence level, confidence interval, and population size, you can choose a sample size that will give you the properly confidence.
Your confidence level, typically around 90% (95% and 99% are also commonly used), tells you how sure you will be that your entire population reflects your results. The confidence interval describes the range of values that you are sure about within your confidence level.
Wikipedia provides an overview of sample size methodologies. That might get you started. But unless you know how sure you want to be of your results, you can't determine a sample size. Wikipedia also provides decent definitions of confidence level and confidence interval.
From a statistical standpoint, if you don't have clearly defined questions, you really can't analyze your data, at least using statistical methods. Perhaps you should review data and try to formulate questions, or take an analytical instead of statistical approach to solving the problem/answering your question.
Assuming a normal distribution, you can use the equation $n \geq \left(\dfrac{z_{\alpha/2} \sigma}{\delta}\right)^2$ where z is found in a table, σ is your standard deviation (which you can use either a sample standard deviation or a population standard deviation, depending on what you know), and δ is your confidence level. The value for n is your population - be sure to round up if you get a fractional n.
Note that the z-value is based on your confidence level. The value for α is a value between 1 - the decimal format of your confidence level (for a confidence level of 95%, α would be 1 - 0.95, or 0.05).
You might also be interested in the answers to a few questions on the Statistical Analysis Stack Exchange: