Solved – Why is the sum of squared differences from the sample mean smaller than the sum of squared differences from the true mean

variance

So the description I read is:

It is important to note that the formulas for calculating the variance
and the standard deviation differ depending on whether you are working
with a distribution of scores taken from a sample or from a
population. The reason these two formulas are different is quite
complex and requires more space than allowed in a short book like
this. I provide an overly brief explanation here and then encourage
you to find a more thorough explanation in a traditional statistics
textbook. Briefly, when we do not know the population mean, we must
use the sample mean as an estimate. But the sample mean will probably
differ from the population mean. Whenever we use a number other than
the actual mean to calculate the variance, we will end up with a
larger variance, and therefore a larger standard deviation, than if we
had used the actual mean. This will be true regardless of whether the
number we use in our formula is smaller or larger than our actual
mean. Because the sample mean usually differs from the population
mean, the variance and standard deviation that we calculate using the
sample mean will probably be smaller than it would have been had we
used the population mean. Therefore, when we use the sample mean to
generate an estimate of the population variance or standard deviation,
we will actually underestimate the size of the true variance in the
population because if we had used the population mean in place of the
sample mean, we would have created a larger sum of squared deviations,
and a larger variance and standard deviation. To adjust for this
underestimation, we use n – 1 in the denominator of our sample
formulas. Smaller denominators produce larger overall variance and
standard deviation statistics, which will be more accurate estimates
of the population parameters.

I understood none of this as it seems contradictory. It says that when using a number other than the actual mean (to me, e.g. the sample mean) will be different then the population mean causing a larger variance. In the middle it says the variance and sample mean will be smaller if the population mean had been used… Can you please explain which it is exactly and why so?

Best Answer

This seems just like sloppy choice of words. The sentence

Whenever we use a number other than the actual mean to calculate the variance, we will end up with a larger variance

seems to mean the term 'actual mean' to refer to the sample mean. I think this is where your confusion comes from. This is evidenced further from the sentence

the variance and standard deviation that we calculate using the sample mean will probably be smaller than it would have been had we used the population mean.

He apparently uses this intuitive motivation to show us why the sample variance is biased down, when he says

Therefore, when we use the sample mean to generate an estimate of the population variance or standard deviation, we will actually underestimate the size of the true variance

To see why this is true, define a function

$$ f(c) = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - c)^2 $$

Using basic calculus, the minimizer of $f$ satisfies $f'(c) = 0$, which is equivalent to

$$ \sum_{i=1}^{n} (x_i - c) = 0 $$

therefore the minimizer is

$$ c^{\star} = \frac{1}{n} \sum_{i=1}^{n} x_{i}, $$

the arithmetic mean of the $x_{i}$'s. I'll leave it to you to check that this is a minimum and not a maximum.

In the case of data, $X_{1}, ..., X_{n}$ with sample mean $\overline{X}$, $f(\overline{X})$ is exactly the sample variance. Given what we've said above

$$ f(\overline{X}) \leq f(z) $$

for any other $z$, which includes the case where $z = \mu$, the population mean. This is why the sample variance is always less than the mean squared difference from the population mean (except in the case where $\overline{X} = \mu$, which occurs with probability zero when the $X_i$'s come from a continuous distribution).

Related Question