Solved – Estimate median from mean, std dev, and/or range

descriptive statisticsestimationmedian

I have easy access to the the mean, standard deviation and min/max values for a variable. I also have the number of elements used to compute those values. The data is real numbers with an absolute zero.

Given this information, is it possible to estimate the median value?

Best Answer

It's certainly possible to place some bounds on the median, but without further assumptions they might potentially be pretty weak bounds. The problem is that the only gauge you have on how skew it might be (particularly, in the sense of the second Pearson skewness) is the relative positions of the extrema to the mean, and they're typically a very weak indicator of that. Adding in the fact that the variable is nonnegative gives a second very weak indicator of skewness (the relative size of the standard deviation and mean).

But the second Pearson skewness does give us a bound: for a distribution, the median cannot be more than one standard deviation from the mean. (For a sample, because of the effect of the usual Bessel correction on standard deviation, it must lie somewhat inside those limits.)

If the standard deviation is small, that may be adequate for some purposes.

If we denote the median as $\stackrel{\sim}{x}$, the mean as $\bar{x}$, the usual sample standard deviation as $s_{n-1}$ (and let $s_n=\sqrt{\frac{n-1}{n}}s_{n-1}$ be the uncorrected s.d.), the minimum as $x_{(1)}$ and the maximum as $x_{(n)}$ then naively, we can immediately say that

$$\max(x_{(1)},\bar{x}-s_n)\leq\,\,\stackrel{\sim}{x}\,\,\leq \min(x_{(n)},\bar{x}+s_n)\,.$$

By more careful consideration of all the information, knowing the minimum and maximum might bound the result still further, but my guess is not necessarily by very much (it may help more in some cases than others). Knowing the sample size, $n$, may also add some important information, particularly if $n$ is small.

The fact that the variable is non-negative might help. Markov's inequality suggests that the median cannot be more than twice the mean, perhaps that may sometimes improve the bound from the mean plus a standard deviation (though, if the s.d. were greater than the mean, you'd usually expect the median to be lower than the mean; again it may be possible to get better bounds still).

Anyway, adding that bound to our previous naive bounds, we have:

$$\max(x_{(1)},\bar{x}-s_n)\leq\,\,\stackrel{\sim}{x}\,\,\leq \min(x_{(n)},\bar{x}+s_n,2\bar{x})\,.$$

(In that situation we also know that the median is above $0$, but given we know $x_{(1)}$, that knowledge doesn't ever improve the lower bound.)


Edit: I simulated a few data sets from different distributions (partly to see how the bounds behaved and partly as a double check that I hadn't made any egregious errors). One of the examples did have the property that $2\bar x$ was a bit less than $\bar x +s_n$ (thus reducing the upper bound on the median, so adding that third component does sometimes help), but as I expected might often be the case, the actual median was less than the mean (so it didn't make the upper bound very close).

Still, the intervals did actually enclose the median for every example I did.


If you assumed some distributional form (like, say, normality), then of course you can get much better estimates (/intervals).