[Math] Denominator to calculate standard deviation

standard deviation

I've just read something about calculating sample standard deviation. Some explanations say that it is because of degree of freedom. With a constraint that the mean should be zero, we do actually have only n-1 degree of freedom. Well, this explanation is pretty convincing.

But there is still another question stick in my mind. If the degree of freedom of sample data is n-1 due to the constraint, then the same constraint also apply to the full data set. So it seems to me that when calculating standard deviation of full data set, we should also use n-1 as denominator. Can any one explain this to me why we don't use n-1 for the full data set?

Please give warm help to a math beginners 🙂

Best Answer

The "degrees of freedom" explanation of using $n-1$ for the sample standard deviation is close to hand-waving.

The use of $n$ in calculating population variance, and so population standard deviation, comes from the definition of variance for a set with a number of equally probable outcomes. It is consistent with the definition for discrete distributions which have points with different probabilities (so when there is no $n$) and with continuous distributions which have densities rather than probabilities.

Take for example the set of of equally probable values $(1,3,3,9)$. This has mean 4, variance 9 and standard deviation 3. So too does the set of equally probable values $(1,1,3,3,3,3,9,9)$. And so does the distribution which is $1$ with probability $\frac{1}{4}$, $3$ with probability $\frac{1}{2}$, and $9$ with probability $\frac{1}{4}$. This consistency is helpful.

So why use $n-1$ as the denominator for sample statistics? The reason is bias. Suppose we take a sample (with replacement) of size $n$ from any of these three distributions. Taking the sum of the sample values and dividing by $n$ (the sample mean) gives us an estimate of the population mean, and while the sample mean will often not be 4, its expected value is 4; so it is an unbiased estimator.

Trying the same to estimate the population variance, by taking the sum of squares of the difference between the sample values and the sample mean and then dividing by $n$, will give us something with expected value $9(n-1)/n$, which is slightly less than $9$; so it is a biased estimator of the population variance. It becomes unbiased if multiplied by $n/(n-1)$ which is the equivalent of using $n-1$ in the denominator. So if an unbiased estimator of the variance is important to you, then this is what you do.

You may have other considerations, in which case you can choose have a different estimator of the variance. It is important to note that even if your estimator for the variance is unbiased, its square root is typically not an unbiased estimator of the standard deviation.