[Math] Standard Deviation: Why divide by $(N-1)$ rather than $N$

standard deviationstatistics

The forumlae for standard deviation seems to be the square root of the sum of the squared deviation from mean divided by $N-1$.

Why isn't it simply the square root of the mean of the squared deviation from mean? i.e, divided by $N$.

Why is it divided by $N-1$ rather than $N$?

Best Answer

If you have n samples, the variance is defined as: $$s^2=\frac{\sum_{i=1}^n (X_i-m)^2}{n}$$ where $m$ is the average of the distribution. In order to have an estimator non $biased$ you have to have: $$E(s^2)=\sigma^2$$ where $\sigma$ is the real unknown value of the variance. It's possible to show that $$E(s^2)=E\left(\frac{\sum_{i=1}^n (X_i-m)^2}{n} \right)=\frac{n}{n-1}\sigma^2$$ So if you want to estimate the 'real' value of $\sigma^2$ you must divide by $n-1$

Related Solutions

Statistics – Sample Standard Deviation vs Population Standard Deviation

There are, in fact, two different formulas for standard deviation here: The population standard deviation $\sigma$ and the sample standard deviation $s$.

If $x_1, x_2, \ldots, x_N$ denote all $N$ values from a population, then the (population) standard deviation is $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2},$$ where $\mu$ is the mean of the population.

If $x_1, x_2, \ldots, x_N$ denote $N$ values from a sample, however, then the (sample) standard deviation is $$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2},$$ where $\bar{x}$ is the mean of the sample.

The reason for the change in formula with the sample is this: When you're calculating $s$ you are normally using $s^2$ (the sample variance) to estimate $\sigma^2$ (the population variance). The problem, though, is that if you don't know $\sigma$ you generally don't know the population mean $\mu$, either, and so you have to use $\bar{x}$ in the place in the formula where you normally would use $\mu$. Doing so introduces a slight bias into the calculation: Since $\bar{x}$ is calculated from the sample, the values of $x_i$ are on average closer to $\bar{x}$ than they would be to $\mu$, and so the sum of squares $\sum_{i=1}^N (x_i - \bar{x})^2$ turns out to be smaller on average than $\sum_{i=1}^N (x_i - \mu)^2$. It just so happens that that bias can be corrected by dividing by $N-1$ instead of $N$. (Proving this is a standard exercise in an advanced undergraduate or beginning graduate course in statistical theory.) The technical term here is that $s^2$ (because of the division by $N-1$) is an unbiased estimator of $\sigma^2$.

Another way to think about it is that with a sample you have $N$ independent pieces of information. However, since $\bar{x}$ is the average of those $N$ pieces, if you know $x_1 - \bar{x}, x_2 - \bar{x}, \ldots, x_{N-1} - \bar{x}$, you can figure out what $x_N - \bar{x}$ is. So when you're squaring and adding up the residuals $x_i - \bar{x}$, there are really only $N-1$ independent pieces of information there. So in that sense perhaps dividing by $N-1$ rather than $N$ makes sense. The technical term here is that there are $N-1$ degrees of freedom in the residuals $x_i - \bar{x}$.

For more information, see Wikipedia's article on the sample standard deviation.

[Math] Why is there not a simpler way to calculate the standard deviation

There is. Your alternative formulation of taking the absolute values of the differences instead of squaring them is called the mean absolute deviation (or average absolute deviation).

Both the mean absolute deviation and the standard deviation are used in practice, but much of the reason the standard deviation is more widely used is that it has nicer theoretical properties. For example, the mean and standard deviation are enough to specify which member of the family of normal distributions you are dealing with (edit: although this is convention, as Robert Israel notes in his comment below), and data values $x$ from a normal distribution with mean $\mu$ and standard deviation $\sigma$ can be transformed to data values $z$ from the standard normal distribution via $z = (x - \mu)/\sigma$. Another advantage of the standard deviation, as Robert Israel notes below, is that there is a simple formula for the standard deviation of the sum of independent random variables. (See also the paper referenced below for more on why we use the standard deviation, as well as some arguments in favor of the mean absolute deviation.)

For an answer to your second question, see my answer to "Sample Standard Deviation vs. Population Standard Deviation." In short, if you were calculating the standard deviation of a population rather than a sample, you would divide by the population size $n$. However, when you calculate the standard deviation of a sample, you have to estimate the population mean that would normally be in the formula with the sample mean. Doing so introduces a bias, as the data values tend to be slightly closer to the sample mean than to the population mean (as the sample mean is itself calculated from the data values). It turns out that dividing by $n-1$ rather than $n$ corrects that bias. (Proving that is a standard exercise in beginning statistical theory.)

Going back to your first question, I recent ran across the paper "Revisiting a 90-year-old debate: the advantages of the mean deviation," by Stephen Girard. The paper is worth reading in full, but let me summarize some of his main points.

Reasons for the standard deviation:

It tends to have a smaller error, on average, when used to estimate a population standard deviation, and so is a more consistent estimate of the standard deviation of a population.
The mean absolute deviation is much more difficult to manipulate algebraically. This makes developing more sophisticated analyses based on it more difficult.
It's part of the definition of the widely-used normal distribution.
Historical: Ronald Fisher, one of the leading figures in the development of statistics, championed its use.

Reasons for the mean absolute deviation:

The standard deviation distorts the amount of dispersion (by the act of squaring the differences) in a data set.
The mean absolute deviation tends to work better in the presence of errors in our data observations.
The mean absolute deviation is less sensitive to outliers in the data (also because of the squaring in the standard deviation).
It's simpler to understand if all you want is a quick measure of dispersion.

Best Answer

Related Solutions

Statistics – Sample Standard Deviation vs Population Standard Deviation

[Math] Why is there not a simpler way to calculate the standard deviation

Related Question