[Math] Why does sample standard deviation underestimate population standard deviation

statistics

Refering to this wikipedia page Unbiased estimation of standard deviation, it says that "it follows from Jensen's inequality that the square root of the sample variance is an underestimate".

I do know that for the concave square root function, Jensen's inequality says that the square root of the mean > mean of the square root.

So, how do we conclude that the square root of the sample variance underestimates population standard deviation?

Since we know from Jensen's inequality that square root of the mean > mean of the square root, does "square root of sample variance" somehow relate to "mean of the square root" while "population standard deviation" somehow relates to "square root of the mean"?

Added after joriki's response:

Given joriki's response about using a single sampling of data, I am now left with why $s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N{(x_i-\overline{x})^2}}$ will underestimate pop std dev. In order to use Jensen's inequality (mean of the square root < square root of the mean). I need to somehow relate the expression for $s$ to "mean of square root". I do see the square root sign in the expression for $s$ but where is the "mean" of this square root quantity?

Best Answer

The mean is part of what it means for an estimator to be biased. You can't make the estimator unbiased by averaging over several estimates; to the contrary, you can show that it's biased by averaging over estimates and showing that the expected average isn't the value to be estimated. (You can reduce the bias and the variance of the estimator by averaging several estimates, but as discussed above you can do that even better by using all the data for one estimate.)

For example, if your population has equidistributed values $-1,0,1$, with variance $\frac23$, and you take a sample of $2$, you'll get variance estimates of $0$, $\frac12$ and $2$ with probabilities $\frac13$, $\frac49$ and $\frac29$, respectively, yielding the correct mean $\frac13\cdot0+\frac49\cdot\frac12+\frac29\cdot2=\frac23$, whereas the estimates for the standard deviation, $0$, $\sqrt{\frac12}$ and $\sqrt2$ average to $\frac13\cdot0+\frac49\cdot\sqrt{\frac12}+\frac29\cdot\sqrt2=\frac49\sqrt2\neq\sqrt{\frac23}$, with $\frac49\sqrt2\approx0.6285\lt0.8165\approx\sqrt{\frac23}$, an underestimate as expected. If you take a sample of $3$ instead, the mean improves to $\frac19\cdot0+\frac49\cdot\sqrt{\frac13}+\frac29\cdot\sqrt{\frac43}+\frac29\cdot1=\frac19(8\sqrt{\frac13}+2)\approx0.7354$.

Related Question