[Math] Why is there not a simpler way to calculate the standard deviation

standard deviationstatistics

Steps of getting standard deviation. http://www.techbookreport.com/tutorials/stddev-30-secs.html:

  1. Work out the average (mean value) of your set of numbers

  2. Work out the difference between each number and the mean

  3. Square the differences

  4. Add up the square of all the differences

  5. Divide this by one less than the number of numbers in your set –
    this is called the variance

  6. Take the square root of the variance and you've got the standard
    deviation

Am I missing out something, or why do we need to square the differences in step 3?

Why not simply do a Abs (multiply all negative numbers by -1) in step 3?

Also, my second question is why do we need to divide by one less than the number of numbers in the set in step 5? why not simply divide by the number of numbers?

Best Answer

There is. Your alternative formulation of taking the absolute values of the differences instead of squaring them is called the mean absolute deviation (or average absolute deviation).

Both the mean absolute deviation and the standard deviation are used in practice, but much of the reason the standard deviation is more widely used is that it has nicer theoretical properties. For example, the mean and standard deviation are enough to specify which member of the family of normal distributions you are dealing with (edit: although this is convention, as Robert Israel notes in his comment below), and data values $x$ from a normal distribution with mean $\mu$ and standard deviation $\sigma$ can be transformed to data values $z$ from the standard normal distribution via $z = (x - \mu)/\sigma$. Another advantage of the standard deviation, as Robert Israel notes below, is that there is a simple formula for the standard deviation of the sum of independent random variables. (See also the paper referenced below for more on why we use the standard deviation, as well as some arguments in favor of the mean absolute deviation.)

For an answer to your second question, see my answer to "Sample Standard Deviation vs. Population Standard Deviation." In short, if you were calculating the standard deviation of a population rather than a sample, you would divide by the population size $n$. However, when you calculate the standard deviation of a sample, you have to estimate the population mean that would normally be in the formula with the sample mean. Doing so introduces a bias, as the data values tend to be slightly closer to the sample mean than to the population mean (as the sample mean is itself calculated from the data values). It turns out that dividing by $n-1$ rather than $n$ corrects that bias. (Proving that is a standard exercise in beginning statistical theory.)


Going back to your first question, I recent ran across the paper "Revisiting a 90-year-old debate: the advantages of the mean deviation," by Stephen Girard. The paper is worth reading in full, but let me summarize some of his main points.

Reasons for the standard deviation:

  • It tends to have a smaller error, on average, when used to estimate a population standard deviation, and so is a more consistent estimate of the standard deviation of a population.
  • The mean absolute deviation is much more difficult to manipulate algebraically. This makes developing more sophisticated analyses based on it more difficult.
  • It's part of the definition of the widely-used normal distribution.
  • Historical: Ronald Fisher, one of the leading figures in the development of statistics, championed its use.

Reasons for the mean absolute deviation:

  • The standard deviation distorts the amount of dispersion (by the act of squaring the differences) in a data set.
  • The mean absolute deviation tends to work better in the presence of errors in our data observations.
  • The mean absolute deviation is less sensitive to outliers in the data (also because of the squaring in the standard deviation).
  • It's simpler to understand if all you want is a quick measure of dispersion.
Related Question