Statistics – Why Variance is the Squared Difference

standard deviationstatistics

So there is this question about why variance is squared.

And the answer seems to be "because we get to do groovy maths when it is squared". Ok, that's cool, I can dig.

However, I'm sitting reading some financial maths stuff, and a lot of the equations on pricing and risk are based on variance. It doesn't seem to be the best basis for, you know, pricing exotic vehicles that are worth in the millions (or billions) that a formula for variance is used "because the maths is better".

To make a point then, why not have the variance be from the ~~cubed~~, abs cubed, or the 4th power (or even a negative power)?

eg (apologies, I don't know Latex)

Sum 1/N * |(x – mean)^3|

Sum 1/N * (x – mean)^4

Would using variance-to-a-different power measurably alter pricings/valuations if the equations still used variance as usual (but the variance was calculated with the different power)?

Is there a reason why we stopped at "power of 2", and are there any implications of using a variance concocted from a different (higher or lower) power?

Best Answer

There is a discussion on Khan Academy on same topic. I found it helpful. The reasons that the standard deviation is preferred to the mean absolute deviation are complicated. To start, let me address your list: Yes, we can use other powers for the deviations, but not just any power. Using the absolute values is not uncommon, and results in the Mean Absolute Deviation. Using squared deviations gives us the Variance (and by square-rooting, the standard deviation).

We don't use, say, the power of 3, because then positive and negative deviations would cancel each other out, which we don't want to do. We could use higher even powers (or define the deviations as a power of the absolute value), but we really don't want to do this. Why? For a few reasons.

The effect of outliers.
The concept of central tendency.
The cleanliness of the math.
Interpretability and Harmony with other concepts.

To explain each of these:

Using squared deviations already places higher weight on larger deviations than using the absolute value. This means that large deviations will try to "pull" results towards themselves, and are able to do so more strongly than the main mass of the data. Many Statisticians already think using squared deviations gives too much weight to large deviations. If we used a higher power, we would be giving even more weight to them.
In Statistics, we like to have "small" variability. As a result of this, we define our measure of central tendency to be a function of how the deviations are measured. That may be gibberish to you, so I'll clarify. We express our variability as:

(1/n) Σ |xi - θ|^k

When k=1 we have absolute deviations, and get the MAD. When k=2, we have squared deviations, and get the Variance. Then the question is: For a given value of k, what is the best value of θ? In other words: what value of θ is going to give us the smallest measure of variability? It turns out that when k=2, we get θ to be the sample mean, xbar. But this isn't always the case. For instance, when k=1, we get θ to be the sample median, not the sample mean (side note: this means that Sal's formula in "Part 5" is wrong, he should be subtracting the median, not the mean).

Getting θ = xbar is a "nice" result. The sample mean has been known for a long time (since the Ancient Greeks, at least), and when people were developing the idea of variability as a quantity we can calculate, they wanted the "best" measure of center to be the sample mean. They tried using k=1, but since that gave the median, they turned to k=2.

The math is much more clean when we use k=2. It's fairly simple to prove that when k=2, θ is the sample mean. It's messier to show that for k=1, θ is the sample median. I had to do it, once, in my PhD coursework. Out of curiosity, I tried using k=4 to figure out what θ should be. I abandoned this once I expanded (xi - θ)^3, because it seemed too messy to pursue for no real reason. This messiness compounds itself when we try to move beyond simply a measure of center, and into more complex models.
Related to point 2, we like the sample mean. In particular, there is a very nice theorem stating that as the sample size increases, the sampling distribution of the sample mean converges towards the Normal distribution (you may or may not be to the point of probability distributions yet, but this will come). Knowing that a particular value will have a Normal distribution under pretty mild conditions is extremely useful. Hence, we "want" to be able to use the sample mean, because it means we can build a lot of theory and a lot of methods using the Normal distribution. Using the sample mean goes hand in hand with squaring deviations (k=2).

Also, the variance is a special case of the "covariance", which describes how two variables interact (the variance is the case when the two variables are actually the exact same variable). So using squared deviations is a natural solution that fits well with other concepts. The idea of squaring deviations also arises naturally out of simple mathematics, irrespective of anything else. For instance, if we wanted to do linear regression, and say that y = xβ, where y and x are a vector and matrix, respectively, and β is a vector, then simple linear algebra will yield the same solution as if we used squared deviations (seeing this connection isn't quite as clear, but it is true).

Related Solutions

[Math] Why is variance squared

A late answer, just for completeness with a different view on the thing.

You might look at your data as measured in a multidimensional space, where each subject is a dimension and each item is a vector in that space from the origin towards the items' measurement over the full subject's space.
Additional remark: this view of things has an additional nice flavour because it uncovers the condition, that the subjects are assumend independent of each other. This is to have the data-space euclidean; changes in that independence-condition require then changes in the mathematics of the space: it has correlated (or "oblique") axes.

Now the distance of one vector-arrowhead to another is just the formula for distances in the Euclidean space, the squarerroot of squares of distances-of-coordinates (from the Pythagorean theorem) : $$d = \sqrt { (x_1-y_1)^2+(x_2-y_2)^2+ \cdots+(x_n-y_n)^2}$$ And the standard-deviation is that value, normed by the number of subjects, if the mean-vector is taken as the $y$-vector. $$\text{sdev} = \sqrt { {(x_1- \bar x)^2 +(x_2-\bar x)^2+ \cdots +(x_n-\bar x)^2 \over n} }$$

Z-test and Chi squared test producing different p-values

I realize that this is not a direct answer to your question. However, using two fundamentally different procedures that I trust, I do not find any conflict in the results. [My guess is that your 'z-test' may be one-sided and your 'chi-squared test' two-sided.]

Data:

conv = c(90,80)
size = c(6000, 4000)
nonc = size - conv
MAT = rbind(conv,nonc)

MAT
     [,1] [,2]
conv   90   80
nonc 5910 3920

One-sided Fisher Exact test:

fisher.test(MAT, alt="less")

         Fisher's Exact Test for Count Data

data:  MAT
p-value = 0.03543
alternative hypothesis: 
  true odds ratio is less than 1
95 percent confidence interval:
 0.00000 0.97505
sample estimates:
odds ratio 
 0.7462279

One-sided test of $p_A = p_B$ against $p_A < p_B:$

prop.test(conv, size, alt="less")

        2-sample test for equality of proportions 
        with continuity correction

data:  conv out of size
X-squared = 3.2975, df = 1, p-value = 0.03469
alternative hypothesis: less
95 percent confidence interval:
 -1.0000000000 -0.0003285328    # Does not incl 0

sample estimates:
prop 1 prop 2 
 0.015  0.020

Two-sided chi-squared contingency test. (Irrelevant because you say you want a one-sided test, but this test is inherently two-sided.)

chisq.test(MAT, cor=F)

        Pearson's Chi-squared test

data:  MAT
X-squared = 3.5904, df = 1, p-value = 0.05811

Best Answer

Related Solutions

[Math] Why is variance squared

Z-test and Chi squared test producing different p-values

Related Question