Sample Size – How to Calculate Sample Size to Ensure Confidence in Sample Mean

confidence intervalsample-size

Unfortunately, it's a long while since I did statistics and despite reading & research I'm not 'confident' as to how to calculate this correctly.

I would like to know the smallest sample size required in order to have a given confidence level that the sample mean would be with a given % of the population mean.

Whilst arbitrary, the following would be known (can be calculated):

population size
population mean
population standard deviation

Sampling would be without replacement.

Distribution is normal.

For example, for a population of 1,000,000 with a mean of 0.90 and a population standard deviation of 1.32 I would need a sample n to be 99% confident that the sample mean is within 1% of the population mean.

I'm interested in understanding the formula as I have to solve this many times for different populations, different confidence levels, and different margins of error. Thank you.

Best Answer

For example, for a population of 1,000,000 with a mean of 0.90 and a population standard deviation of 1.32 I would need a sample n to be 99% confident that the sample mean is within 1% of the population mean.

Okay.

Sampling would be without replacement.

With a million in the population?

~~To a first approximation, it doesn't matter enough to be worth worrying about~~

Actually, turns out in this case it does. I'll do it both without replacement and with. With replacement is simpler, and I do it first.

Distribution is normal.

Don't need it. The sample size will be large enough that with the other assumptions, only really strongly non-normal distributions will have any impact.

Can we assume independence (apart from the effect of sampling without replacement)? e.g. sampling completely at random? I'll take it that we can.

$\mu = 0.90$

$\sigma = 1.32$

Want 'to be 99% confident that the sample mean is within 1% of the population mean'.

i.e. Find $n$ such that $P(|\bar{x}-\mu| < .01\mu) = 0.99$

$\bar{x}-\mu \sim N(0, \frac{\sigma^2}{n})$

99% of a normal distribution is within 2.576 s.d.'s of the population mean (this figure is gettable from normal tables, or using a function in a program. I used R) ` Thus I need $2.576 \times \sigma/\sqrt{n} < 0.01 \mu = 0.009$

Hence $2.576^2 \sigma^2/n < 0.009^2$

Hence $2.576^2 \sigma^2 < n \times 0.009^2$

Or $n > (2.576 \times 1.32/0.009)^2 = 142742.9$

So if $n$ is about 142700, (the means and sd's and normal table values were only accurate to about the same number of figures - only the first 3-4 digits will be meaningful) then the required probability statement should hold.

If we allow for the 'without replacement' the sample size would reduce about 14% percent (google for finite population correction to the variance); other factors are likely to affect you by more than a couple of percent (like not having perfectly random sampling, for one example)

Let's look at the without replacement case using the finite population correction now.

The finite population correction multiplies the variance by a factor $f = \frac{N-n}{N-1} = 1-\frac{n-1}{N-1}$.

Some people approximate this by $1 -\, n/N$, which is easily accurate enough with the large numbers for $n$ and $N$ involved here. However, I'll try to do the first version there.

$2.576^2 \sigma^2 (N-n)/(N-1) < n \times 0.009^2$

$(2.576\sigma/0.009)^2 /(N-1) < n/(N-n) $

$(2.576\sigma/0.009)^2 /(N-1) < 1/[N/n\,\,\, -1] $

$142743 \times 1000000/1142742 < n$

So (if I did that right), $n > 124912.7$

Or to the accuracy in the normal value, $n$ should be about $124900$.

(assuming the mean and s.d. are actually accurate to at least 4 figures, too)

Calculation check:

Interval half-width =

$(2.576\times 1.32/\sqrt{124900})\sqrt{(1000000-124900)/999999}$

$= 0.00900$

Related Solutions

Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size

Obviously you will need to know the type of confidence interval you are dealing with, but let's suppose that this is a standard one-sample confidence interval for the mean, using the standard T-statistic as the pivotal quantity. In that case, the formula for the interval is:

$$\text{CI}(1-\alpha) = \Bigg[ \bar{x} \pm \frac{t_{n-1, \alpha/2}}{\sqrt{n}} \cdot s \Bigg].$$

Thus, if we denote the known lower and upper bounds of the interval as $l$ and $u$ respectively, then you can algebraically reverse-engineer the sample mean and sample standard deviation as:

$$\bar{x} = \frac{l+u}{2} \quad \quad \quad \quad \quad s = \frac{u-l}{2} \cdot \frac{\sqrt{n}}{t_{n-1, \alpha/2}}.$$

With the values specified in your example, you get:

#Set preliminary values
l     <- 5.18;
u     <- 5.38;
n     <- 300;
alpha <- 0.05;

#Compute sample mean and SD
crit <- qt(alpha/2, df = n-1, lower.tail = FALSE);
MEAN <- (l+u)/2;
SD   <- (u-l)*sqrt(n)/(2*crit);

#Print the values
MEAN;
[1] 5.28
SD;
[1] 0.8801386

Thus, assuming that your interval was a standard one-sample confidence interval, you must have had a sample mean $\bar{x} = 5.28$ and sample standard deviation $s = 0.88$.

Confidence Intervals – Estimating Mean Confidence Intervals for a Sample with Known Population Standard Deviation

Suppose you have 150 locations altogether, and you decide to base your confidence interval for the mean of the population (for some attribute) from a sample of size 10.

whole = rnorm(150, 50, 7)
x = sample(whole, 10)
summary(x);  length(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  37.86   43.03   45.24   47.92   52.61   59.93 
[1] 10         # sample size
[1] 7.470816   # sample standard deviation

t.test(x)$conf.int
[1] 42.57347 53.26207
attr(,"conf.level")
[1] 0.95

The mean for the whole company is 50; a 95% confidence interval for the mean is $(42.6, 53.3).$ I used the t.test procedure in R, but the 95% CI can be found from the formula $\bar X \pm t^* S/\sqrt{n},$ where $t^* = 2.262$ cuts probability 2.5% from the upper tail of Student's t distribution with $\nu = n-1 = 9$ degrees of freedom

qt(.975, 9)
[1] 2.262157

mean(x) + qt(c(.025,.975),9)*sd(x)/sqrt(10)
[1] 42.57347 53.26207

If you knew the population standard deviation $\sigma=7,$ then you could use $\bar X \pm 1.96(7/\sqrt{10}),$ which computes to $(42.6,53.3)).$ In general, this method has the potential to be a little more accurate, but there is no difference (to one place accuracy) from the CI above for this example.

mean(x) + qnorm(c(.025,.975))*7/sqrt(10)
[1] 43.57920 52.25633

Notes: (1) You are sampling from a finite population of size 150. As long as the sample size (here $n=10)$ is less than 10% of the population size, these formulas for sampling from essentially infinite populations should give useful results.

(2) These methods assume that the population values are approximately normally distributed. These methods would not work well if you had a few locations that are hugely different from any of the others.

(3) Your idea of doing some sort of stratified sampling so several provinces are represented or that some observations are from urban and some are from rural location might be useful. That would depend on whether there are large differences among provinces or between rural or urban locations. Stratified sampling would make it somewhat more difficult to make a confidence interval.

(4) Here, because I simulated the whole population, we can find the exact population mean and standard deviation and we know that the data are normal. In most actual applications this information would not necessarily be known.

(5) If you have some data for all 100+ scores, you might try the ttest` on a sample of a dozen or so locations to how well it workd in your application.

Best Answer

Related Solutions

Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size

Confidence Intervals – Estimating Mean Confidence Intervals for a Sample with Known Population Standard Deviation

Related Question