Solved – Should the mean be used when data are skewed

central-tendencymeanmedianskewnesswinsorizing

Often introductory applied statistics texts distinguish the mean from the median (often in the the context of descriptive statistics and motivating the summarization of central tendency using the mean, median and mode) by explaining that the mean is sensitive to outliers in sample data and/or to skewed population distributions, and this is used as a justification for an assertion that the median is to be preferred when the data are not symmetrical.

For example:

The best measure of central tendency for a given set of data often depends on the way in which the values are distributed…. When data are not symmetric, the median is often the best measure of central tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the outlying data values, and as a result might end up excessively inflated or excessively deflated."

—Pagano and Gauvreau, (2000) Principles of Biostatistics, 2nd ed. (P&G were at hand, BTW, not singling them out per se.)

The authors define "central tendency" thus: "The most commonly investigated characteristic of a set of data is its center, or the point about which observations tend to cluster."

This strikes me as a less-than forthright way of saying only use the median, period, because only using the mean when the data/distributions are symmetrical is the same thing as saying only use the mean when it equals the median. Edit: whuber rightly points out that I am conflating robust measures of central tendency with the median. So it is important to keep in mind that I am discussing the specific framing of the arithmetic mean versus the median in introductory applied statistics (where, mode aside, other measures of central tendency are not motivated).

Rather than judging the utility of the mean by how much it departs from the behavior of the median, ought we not simply understand these as two different measures of centrality? In other words being sensitive to skewness is a feature of the mean. One could just as validly argue "well the median is no good because it is largely insensitive to skewness, so only use it when it equals the mean."

(The mode is quite sensibly not getting involved with this question.)

Best Answer

I disagree with the advice as a flat out rule. (It's not common to all books.)

The issues are more subtle.

If you're actually interested in making inference about the population mean, the sample mean is at least an unbiased estimator of it, and has a number of other advantages. In fact, see the Gauss-Markov theorem - it's best linear unbiased.

If your variables are heavily skew, the problem comes with 'linear' - in some situations, all linear estimators may be bad, so the best of them may still be unattractive, so an estimator of the mean which is not-linear may be better, but it would require knowing something (or even quite a lot) about the distribution. We don't always have that luxury.

If you're not necessarily interested in inference relating to a population mean ("what's a typical age?", say or whether there's a more general location shift from one population to another, which might be phrased in terms of any location, or even of a test of one variable being stochastically larger than another), then casting that in terms of the population mean is either not necessary or likely counterproductive (in the last case).

So I think it comes down to thinking about:

  • what are your actual questions? Is population mean even a good thing to be asking about in this situation?

  • what is the best way to answer the question given the situation (skewness in this case)? Is using sample means the best approach to answering our questions of interest?

It may be that you have questions not directly about population means, but nevertheless sample means are a good way to look at those questions (estimating the population median of a waiting time that you assume to be distributed as ab exponential random variable, for example is better estimated as a particular fraction of the sample mean) ... or vice versa - the question might be about population means but sample means might not be the best way to answer that question.