Solved – When does the amount of skew or prevalence of outliers make the median preferable to the mean

meanmedian

I know that "deviations in the data are devil", and when the distribution is highly skewed, it is better to consider median as average rather than mean, but how to decide these hard-limits.

For example:

  • CASE 1:

    • Assume X = 10,20,30,40,50,60,70
    • In this case, I think that it is better to use mean and that it will give very accurate results.
  • CASE 2:

    • Assume X = 10,20,30,40,50,60,70,7000
    • In this case, I think that it is better to use median instead of using the mean.
  • CASE 3:

    • Assume X = 10,20,30,400,500,600,700
    • In this case, I think it is better to use IQR (Inter Quartile Range)

But I'm stuck with how to decide these hard-limits i.e. which to use in which condition, in general.

I've found a tool working on subjected principle, which takes context-less sample-distribution as input and determines whether mean is close/moderate or against the null-hypothesis.

Find References:-

What I'm really looking is a good answer which states how to derive these conclusions.

Best Answer

Framing the question

  • You are asking an applied and subjective question, and thus, any answer needs to be infused with applied and subjective considerations.

  • From a purely statistical perspective, the mean and median both provide different information about the central tendency of a sample of data. Thus, neither is correct or incorrect by definition.

  • From an applied perspective, we often want to say something meaningful about the central tendency of a sample, where central tendency maps onto some subjective notion of "typical".

General thoughts

  • When summarising what is typical in a sample, observations that are many standard deviations away from the mean (perhaps 3 or 4 SD) will have a large influence on the mean, but not the median. Such observations may lead the mean to deviate from what we think of as the "typical" value of the sample. This helps to explain the popularity of the median when it comes to reporting house prices and income, where a single island in the pacific or billionaire could dramatically influence the mean, but not the median. Such distributions can often include extreme outliers, and the distribution is positively skewed. In contrast, the median is robust.

  • The median can be problematic when the data takes on a limited number of values. For example, the median of a 5-point Likert item lacks the nuance possessed by the mean. For example, means of 2.8, 3.0, and 3.3 might all have a median of 3.

  • In general, the mean has the benefit of using more of the information from the data.

  • When skewed distributions exist, it is also possible to transform the distribution and report the mean of the transformed distribution.

  • When a distribution includes outliers, it is possible to use a trimmed mean, or remove the outliers, or adjust the value of the outlier to a less extreme value (e.g., 2 SD from the mean).

Related Question