Solved – Why does increasing the sample size lower the (sampling) variance

samplingstatistical-powervariance

Big picture:

I'm trying to understand how increasing the sample size increases the power of an experiment.
My lecturer's slides explain this with a picture of 2 normal distributions, one for the null-hypothesis and one for the alternative-hypothesis and a decision threshold c between them.
They argue that increasing sample size will lower variance and thereby cause a higher kurtosis, reducing the shared area under the curves and so the probability of a type II error.

Small picture:

I don't understand how a bigger sample size will lower the variance.
I assume you just calculate the sample variance and use it as a parameter in a normal distribution.

I tried:

  • googling, but most accepted answers have 0 upvotes or are merely examples
  • thinking: By the law of big numbers every value should eventually stabilize around its probable value according to the normal distribution we assume. And the variance should therefore converge to the variance of our assumed normal distribution. But what is the variance of that normal distribution and is it a minimum value i.e. can we be sure our sample variance decreases to that value?

Best Answer

Standard deviations of averages are smaller than standard deviations of individual observations. [Here I will assume independent identically distributed observations with finite population variance; something similar can be said if you relax the first two conditions.]

It's a consequence of the simple fact that the standard deviation of the sum of two random variables is smaller than the sum of the standard deviations (it can only be equal when the two variables are perfectly correlated).

In fact, when you're dealing with uncorrelated random variables, we can say something more specific: the variance of a sum of variates is the sum of their variances.

This means that with $n$ independent (or even just uncorrelated) variates with the same distribution, the variance of the mean is the variance of an individual divided by the sample size.

Correspondingly with $n$ independent (or even just uncorrelated) variates with the same distribution, the standard deviation of their mean is the standard deviation of an individual divided by the square root of the sample size:

$\sigma_{\bar{X}}=\sigma/\sqrt{n}$.

So as you add more data, you get increasingly precise estimates of group means. A similar effect applies in regression problems.

Since we can get more precise estimates of averages by increasing the sample size, we are more easily able to tell apart means which are close together -- even though the distributions overlap quite a bit, by taking a large sample size we can still estimate their population means accurately enough to tell that they're not the same.