Statistics – Motivation Behind Standard Deviation

intuitionstandard deviationstatistics

Let's take the numbers 0-10. Their mean is 5, and the individual deviations from 5 are
-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5
And so the average (magnitude of) deviation from the mean is $30/11 \approx 2.72$.

However, this is not the standard deviation. The standard deviation is $\sqrt{10} \approx 3.16$.

The first mean-deviation is a simpler and by far more intuitive definition of the "standard-deviation", so I'm sure it's the first definition statisticians worked with. However, for some reason they decided to adopt the second definition instead. What is the reasoning behind that decision?

Best Answer

Your guess is correct: least absolute deviations was the method tried first historically. The first to use it were astronomers who were attempting to combine observations subject to error. Boscovitch in 1755 published this method and a geometric solution. It was used later by Laplace in a 1789 work on geodesy. Laplace formulated the problem more mathematically and described an analytical solution.

Legendre appears to be the first to use least squares, doing so as early as 1798 for work in celestial mechanics. However, he supplied no probabilistic justification. A decade later, Gauss (in an 1809 treatise on celestial motion and conic sections) asserted axiomatically that the arithmetic mean was the best way to combine observations, invoked the maximum likelihood principle, and then showed that a probability distribution for which the likelihood is maximized at the mean must be proportional to $\exp(-x^2 / (2 \sigma^2))$ (now called a "Gaussian") where $\sigma$ quantifies the precision of the observations.

The likelihood (when the observations are statistically independent) is the product of these Gaussian terms which, due to the presence of the exponential, is most easily maximized by minimizing the negative of its logarithm. Up to an additive constant, the negative log of the product is the sum of the squares (all divided by a constant $2 \sigma^2$, which will not affect the minimization). Thus, even historically, the method of least squares is intimately tied up with likelihood calculations and averaging. There are plenty of other modern justifications for least squares, of course, but this derivation by Gauss--with the almost magical appearance of the Gaussian, which had first appeared some 70 years early in De Moivre's work on sums of Bernoulli variables (the Central Limit Theorem)--is memorable.

This story was researched, and is ably recounted, by Steven Stigler in his The History of Statistics - The Measurement of Uncertainty before 1900 (1986). Here I have merely given the highlights of parts of chapters 1 and 4.

Related Solutions

[Math] Calculating mean and standard deviation of very large sample sizes

Posting as an answer in response to comments.

Here's a way to compute the mean and standard deviation in one pass over the file. (Pseudocode.)

n = r1 = r2 = 0;
while (more_samples()) {
    s = next_sample();
    n += 1;
    r1 += s;
    r2 += s*s;
}
mean = r1 / n;
stddev = sqrt(r2/n - (mean * mean));

Essentially, you keep a running total of the sum of the samples and the sum of their squares. This lets you easily compute the standard deviation at the end.

Why Take the Square Root in Standard Deviation Calculation?

Standard deviation is not the average distance from the mean, as your example shows.

The reason for using standard deviation rather than mean absolute deviation is that the variance of $\{x_i\}_{i=1}^m$ plus the variance of $\{y_j\}_{j=1}^m$ is the variance of $\{x_i+y_j\}_{i=1,\,j=1}^{n,\,m}$ (but only if you define variance in the way that puts $n$ and $m$ rather than the Bessel-corrected $n-1$ and $m-1$ in the denominators). This makes it possible, for example, to apply the central limit theorem to find the probability that when you toss $1800$ coins, the number of heads is between $890$ and $920$. You can find the standard deviation of the number of heads because of the additivity of variances.

Standard deviation and mean absolute deviation have in common that they are both translation invariant and both scale-equivariant for non-negative changes in scale, so either can be used as a measure of dispersion.

Best Answer

Related Solutions

[Math] Calculating mean and standard deviation of very large sample sizes

Why Take the Square Root in Standard Deviation Calculation?

Related Question