Solved – Using the median for calculating Variance

meanmedianvariance

I have a 1-D random variable which is extremely skewed. In order to normalize this distribution, I want to use the median rather than the mean. my question is this: can I calculate the variance of the distribution using the median in the formula instead of the mean?

i.e. can I replace

$$ \mathrm{Var}(X) = \sum[(X_i – \mathrm{mean}(X))^2]/n $$

with

$$ \mathrm{Var}(X) = \sum[(X_i – \mathrm{median}(X))^2]/n $$

My reasoning behind this is that since variance is a measure of spread w.r.t. the central tendency of a distribution, it shouldn't be a problem but I'm looking to validate this logic.

Best Answer

Mean minimizes the squared error (or L2 norm, see here or here), so natural choice for variance to measure distance from the mean is to use squared error (see here on why we square it). On the other hand, median minimizes the absolute error (L1 norm), i.e. it is a value that is in the "middle" of your data, so absolute distance from the median (so called Median Absolute Deviation or MAD) seems to be a better measure of the degree of variability around the median. You can read more about this relations in this thread.

Saying it short, variance differs from MAD on how do they define the central point of your data and this influences the way how we measure variation of datapoints around it. Squaring the values make outliers have greater influence on the central point (mean), while in case of median, all the points have the same impact on it, so the absolute distance seems more appropriate.

This can be shown also by simple simulation. If you compare values squared distances from the mean and median, then the total squared distance is almost always smaller from mean than from median. On the other hand, total absolute distance is smaller from median, then from mean. The R code for conducting the simulation is posted below.

sqtest  <- function(x) sum((x-mean(x))^2)  < sum((x-median(x))^2)
abstest <- function(x) sum(abs(x-mean(x))) > sum(abs(x-median(x)))

mean(replicate(1000, sqtest(rnorm(1000))))
mean(replicate(1000, abstest(rnorm(1000))))

mean(replicate(1000, sqtest(rexp(1000))))
mean(replicate(1000, abstest(rexp(1000))))

mean(replicate(1000, sqtest(runif(1000))))
mean(replicate(1000, abstest(runif(1000))))

In the case of using median instead of mean in estimating such "variance" this would lead to higher estimates, than with using mean as it is done traditionally.

By the way, the relations of L1 and L2 norms can be considered also in the Bayesian context, as in this thread.