Statistics – Why Variance is the Squared Difference

standard deviationstatistics

So there is this question about why variance is squared.

And the answer seems to be "because we get to do groovy maths when it is squared". Ok, that's cool, I can dig.

However, I'm sitting reading some financial maths stuff, and a lot of the equations on pricing and risk are based on variance. It doesn't seem to be the best basis for, you know, pricing exotic vehicles that are worth in the millions (or billions) that a formula for variance is used "because the maths is better".

To make a point then, why not have the variance be from the cubed, abs cubed, or the 4th power (or even a negative power)?

eg (apologies, I don't know Latex)

Sum 1/N * |(x – mean)^3|

OR

Sum 1/N * (x – mean)^4

Would using variance-to-a-different power measurably alter pricings/valuations if the equations still used variance as usual (but the variance was calculated with the different power)?

Is there a reason why we stopped at "power of 2", and are there any implications of using a variance concocted from a different (higher or lower) power?

Best Answer

There is a discussion on Khan Academy on same topic. I found it helpful. The reasons that the standard deviation is preferred to the mean absolute deviation are complicated. To start, let me address your list: Yes, we can use other powers for the deviations, but not just any power. Using the absolute values is not uncommon, and results in the Mean Absolute Deviation. Using squared deviations gives us the Variance (and by square-rooting, the standard deviation).

We don't use, say, the power of 3, because then positive and negative deviations would cancel each other out, which we don't want to do. We could use higher even powers (or define the deviations as a power of the absolute value), but we really don't want to do this. Why? For a few reasons.

  1. The effect of outliers.
  2. The concept of central tendency.
  3. The cleanliness of the math.
  4. Interpretability and Harmony with other concepts.

To explain each of these:

  1. Using squared deviations already places higher weight on larger deviations than using the absolute value. This means that large deviations will try to "pull" results towards themselves, and are able to do so more strongly than the main mass of the data. Many Statisticians already think using squared deviations gives too much weight to large deviations. If we used a higher power, we would be giving even more weight to them.

  2. In Statistics, we like to have "small" variability. As a result of this, we define our measure of central tendency to be a function of how the deviations are measured. That may be gibberish to you, so I'll clarify. We express our variability as:

(1/n) Σ |xi - θ|^k

When k=1 we have absolute deviations, and get the MAD. When k=2, we have squared deviations, and get the Variance. Then the question is: For a given value of k, what is the best value of θ? In other words: what value of θ is going to give us the smallest measure of variability? It turns out that when k=2, we get θ to be the sample mean, xbar. But this isn't always the case. For instance, when k=1, we get θ to be the sample median, not the sample mean (side note: this means that Sal's formula in "Part 5" is wrong, he should be subtracting the median, not the mean).

Getting θ = xbar is a "nice" result. The sample mean has been known for a long time (since the Ancient Greeks, at least), and when people were developing the idea of variability as a quantity we can calculate, they wanted the "best" measure of center to be the sample mean. They tried using k=1, but since that gave the median, they turned to k=2.

  1. The math is much more clean when we use k=2. It's fairly simple to prove that when k=2, θ is the sample mean. It's messier to show that for k=1, θ is the sample median. I had to do it, once, in my PhD coursework. Out of curiosity, I tried using k=4 to figure out what θ should be. I abandoned this once I expanded (xi - θ)^3, because it seemed too messy to pursue for no real reason. This messiness compounds itself when we try to move beyond simply a measure of center, and into more complex models.

  2. Related to point 2, we like the sample mean. In particular, there is a very nice theorem stating that as the sample size increases, the sampling distribution of the sample mean converges towards the Normal distribution (you may or may not be to the point of probability distributions yet, but this will come). Knowing that a particular value will have a Normal distribution under pretty mild conditions is extremely useful. Hence, we "want" to be able to use the sample mean, because it means we can build a lot of theory and a lot of methods using the Normal distribution. Using the sample mean goes hand in hand with squaring deviations (k=2).

Also, the variance is a special case of the "covariance", which describes how two variables interact (the variance is the case when the two variables are actually the exact same variable). So using squared deviations is a natural solution that fits well with other concepts. The idea of squaring deviations also arises naturally out of simple mathematics, irrespective of anything else. For instance, if we wanted to do linear regression, and say that y = xβ, where y and x are a vector and matrix, respectively, and β is a vector, then simple linear algebra will yield the same solution as if we used squared deviations (seeing this connection isn't quite as clear, but it is true).

Related Question