[Math] Denominator to calculate standard deviation

standard deviation

I've just read something about calculating sample standard deviation. Some explanations say that it is because of degree of freedom. With a constraint that the mean should be zero, we do actually have only n-1 degree of freedom. Well, this explanation is pretty convincing.

But there is still another question stick in my mind. If the degree of freedom of sample data is n-1 due to the constraint, then the same constraint also apply to the full data set. So it seems to me that when calculating standard deviation of full data set, we should also use n-1 as denominator. Can any one explain this to me why we don't use n-1 for the full data set?

Please give warm help to a math beginners 🙂

Best Answer

The "degrees of freedom" explanation of using $n-1$ for the sample standard deviation is close to hand-waving.

The use of $n$ in calculating population variance, and so population standard deviation, comes from the definition of variance for a set with a number of equally probable outcomes. It is consistent with the definition for discrete distributions which have points with different probabilities (so when there is no $n$) and with continuous distributions which have densities rather than probabilities.

Take for example the set of of equally probable values $(1,3,3,9)$. This has mean 4, variance 9 and standard deviation 3. So too does the set of equally probable values $(1,1,3,3,3,3,9,9)$. And so does the distribution which is $1$ with probability $\frac{1}{4}$, $3$ with probability $\frac{1}{2}$, and $9$ with probability $\frac{1}{4}$. This consistency is helpful.

So why use $n-1$ as the denominator for sample statistics? The reason is bias. Suppose we take a sample (with replacement) of size $n$ from any of these three distributions. Taking the sum of the sample values and dividing by $n$ (the sample mean) gives us an estimate of the population mean, and while the sample mean will often not be 4, its expected value is 4; so it is an unbiased estimator.

Trying the same to estimate the population variance, by taking the sum of squares of the difference between the sample values and the sample mean and then dividing by $n$, will give us something with expected value $9(n-1)/n$, which is slightly less than $9$; so it is a biased estimator of the population variance. It becomes unbiased if multiplied by $n/(n-1)$ which is the equivalent of using $n-1$ in the denominator. So if an unbiased estimator of the variance is important to you, then this is what you do.

You may have other considerations, in which case you can choose have a different estimator of the variance. It is important to note that even if your estimator for the variance is unbiased, its square root is typically not an unbiased estimator of the standard deviation.

Related Solutions

[Math] Standard Deviation vs Standard Error

In the context of your example: Does it really make a difference which you use? If standard deviation is 's'. The standard error is just s/SQRT(n). This is a linear transformation; therefore for the purpose of comparison it makes no difference which you use.

Now, for the purpose of making a statistical test. If you calculate a group of 'sample means' all independent and identically distributed. Then that sample of 'sample means' would have standard deviation given by s/SQRT(n).

Its intuitive that using a sample mean would give more information of the data, therefore s/SQRT(n) < s; that is to say the variability in the sample of 'sample means' is less than the variability in the individual sample.

TL;DR

-use 's' for the sake of comparison.

-For statistical tests use 's/SQRT(n)' when say finding the distribution or confidence interval for the sample means.

-For statistical tests use 's' when say finding the distribution or confidence interval for individual samples.

[Math] Standard error of standard deviation, skewness and kurtosis

Although I don't think it's quite what you were looking for, the following may be helpful. In Mathematica, we can say things like:

In[1]:= Expectation[(Variance[{w, x, y, z}]-1)^2, {w \[Distributed] 
   NormalDistribution[], x \[Distributed] NormalDistribution[], 
  y \[Distributed] NormalDistribution[], 
  z \[Distributed] NormalDistribution[]}]
Out[1]:= 2/3

If you try a few, you find that the expected variance comes out to be 2/(n-1) (when the original normal distribution had a standard deviation of 1). We can confirm this by Monte-Carlo methods.

In[2]:= Table[{N[2/(n - 1)], 
  Mean[Table[(Variance[RandomVariate[NormalDistribution[], n]] - 1)^2,
     100000]]}, {n, 2, 20}]
Out[2]:= {{2., 2.02504}, {1., 0.990561}, {0.666667, 0.671811}, {0.5, 
  0.503348}, {0.4, 0.403083}, {0.333333, 0.335526}, {0.285714, 
  0.286752}, {0.25, 0.250503}, {0.222222, 0.22472}, {0.2, 
  0.200344}, {0.181818, 0.180828}, {0.166667, 0.168419}, {0.153846, 
  0.154251}, {0.142857, 0.141677}, {0.133333, 0.13277}, {0.125, 
  0.125321}, {0.117647, 0.117411}, {0.111111, 0.111469}, {0.105263, 
  0.104944}}

If we continue in this vein and consider the case where the standard deviation of the distribution, $\sigma$, isn't 1, we find the standard deviation of the variance for a sample of size $n$ is $$ \sigma^2\sqrt{\frac{2}{n-1}} $$

This result and much more can be found in this useful summary paper: Standard errors: A review and evaluation of standard error estimators using Monte Carlo simulations (DOI: 10.20982/tqmp.10.2.p107)

Best Answer

Related Solutions

[Math] Standard Deviation vs Standard Error

[Math] Standard error of standard deviation, skewness and kurtosis

Related Question