Mathematical Statistics – Why is the Median Less Sensitive to Extreme Values Compared to the Mean?

mathematical-statisticsmeanmedianoutliersrobust

I am sure we have all heard the following argument stated in some way or the other:

  • For a given set of measurements (e.g. heights of students), the mean of these measurements is more "prone" to be influenced by outliers compared to the median of these same measurements.

Conceptually, the above argument is straightforward to understand. The median is not directly calculated using the "value" of any of the measurements, but only using the "ranked position" of the measurements. On the other hand, the mean is directly calculated using the "values" of the measurements, and not by using the "ranked position" of the measurements. Therefore, a statistically larger number of outlier points should be required to influence the median of these measurements – compared to influence of fewer outlier points on the mean. For example: the average weight of a blue whale and 100 squirrels will be closer to the blue whale's weight, but the median weight of a blue whale and 100 squirrels will be closer to the squirrels.

Using the R programming language, we can see this argument manifest itself on simulated data:

library(plotly)
    set.seed(123)


d = data.frame(data = rnorm(20, 5,50), col = "non outlier")

dd = data.frame(data = rnorm(5,150, 10), col = "outlier")

my_data = rbind(d,dd)

> mean(d$data)
[1] 10.08877

> median(d$data)
[1] 17.11447

We can also plot this to get a better idea:

d1 = data.frame(data = mean(my_data$data), col = "mean")
# add "1" to the median so that it becomes visible in the plot
d2 = data.frame(data = median(my_data$data) +1 , col = "median")

new_data = rbind(my_data, d1, d2)


plot_ly(type = "scatter", mode = "markers", data = new_data, x = ~data, y = " ", color = ~col ) %>% layout(title = 'Effect of Outliers on Median vs Mean')

enter image description here

My Question: In the above example, we can see that the median is less influenced by the outliers compared to the mean – but in general, are there any "statistical proofs" that shed light on this inherent "vulnerability" of the mean compared to the median?

Apart from the logical argument of measurement "values" vs. "ranked positions" of measurements – are there any theoretical arguments behind why the median requires larger valued and a larger number of outliers to be influenced towards the extremas of the data compared to the mean?

I am aware of related concepts such as Cooke's Distance (https://en.wikipedia.org/wiki/Cook%27s_distance) which can be used to estimate the effect of removing an individual data point on a regression model – but are there any formulas which show some relation between the number/values of outliers on the mean vs. the median?

Are there any theoretical statistical arguments that can be made to justify this logical argument regarding the number/values of outliers on the mean vs. the median?

Best Answer

if you write the sample mean $\bar x$ as a function of an outlier $O$, then its sensitivity to the value of an outlier is $d\bar x(O)/dO=1/n$, where $n$ is a sample size. the same for a median is zero, because changing value of an outlier doesn't do anything to the median, usually.

example to demonstrate the idea: 1,4,100. the sample mean is $\bar x=35$, if you replace 100 with 1000, you get $\bar x=335$. the median stays the same 4.

this is assuming that the outlier $O$ is not right in the middle of your sample, otherwise, you may get a bigger impact from an outlier on the median compared to the mean.

TL;DR;

adding the outlier

you may be tempted to measure the impact of an outlier by adding it to the sample instead of replacing a valid observation with na outlier. it can be done, but you have to isolate the impact of the sample size change. if you don't do it correctly, then you may end up with pseudo counter factual examples, some of which were proposed in answers here. I'll show you how to do it correctly, then incorrectly.

The mean $x_n$ changes as follows when you add an outlier $O$ to the sample of size $n$: $$\bar x_{n+O}-\bar x_n=\frac {n \bar x_n +O}{n+1}-\bar x_n$$ Now, let's isolate the part that is adding a new observation $x_{n+1}$ from the outlier value change from $x_{n+1}$ to $O$. We have to do it because, by definition, outlier is an observation that is not from the same distribution as the rest of the sample $x_i$. Remember, the outlier is not a merely large observation, although that is how we often detect them. It is an observation that doesn't belong to the sample, and must be removed from it for this reason. Here's how we isolate two steps: $$\bar x_{n+O}-\bar x_n=\frac {n \bar x_n +x_{n+1}}{n+1}-\bar x_n+\frac {O-x_{n+1}}{n+1}\\ =(\bar x_{n+1}-\bar x_n)+\frac {O-x_{n+1}}{n+1}$$

Now, we can see that the second term $\frac {O-x_{n+1}}{n+1}$ in the equation represents the outlier impact on the mean, and that the sensitivity to turning a legit observation $x_{n+1}$ into an outlier $O$ is of the order $1/(n+1)$, just like in case where we were not adding the observation to the sample, of course. Note, that the first term $\bar x_{n+1}-\bar x_n$, which represents additional observation from the same population, is zero on average.

If we apply the same approach to the median $\bar{\bar x}_n$ we get the following equation: $$\bar{\bar x}_{n+O}-\bar{\bar x}_n=(\bar{\bar x}_{n+1}-\bar{\bar x}_n)+0\times(O-x_{n+1})\\=(\bar{\bar x}_{n+1}-\bar{\bar x}_n)$$ In other words, there is no impact from replacing the legit observation $x_{n+1}$ with an outlier $O$, and the only reason the median $\bar{\bar x}_n$ changes is due to sampling a new observation from the same distribution.

a counter factual, that isn't

The analysis in previous section should give us an idea how to construct the pseudo counter factual example: use a large $n\gg 1$ so that the second term in the mean expression $\frac {O-x_{n+1}}{n+1}$ is smaller that the total change in the median. Here's one such example: "... our data is 5000 ones and 5000 hundreds, and we add an outlier of -100..."

Let's break this example into components as explained above. As an example implies, the values in the distribution are 1s and 100s, and -100 is an outlier. So, we can plug $x_{10001}=1$, and look at the mean: $$\bar x_{10000+O}-\bar x_{10000} =\left(50.5-\frac{505001}{10001}\right)+\frac {-100-\frac{505001}{10001}}{10001}\\\approx 0.00495-0.00150\approx 0.00345$$ The term $-0.00150$ in the expression above is the impact of the outlier value. It's is small, as designed, but it is non zero.

The same for the median: $$\bar{\bar x}_{10000+O}-\bar{\bar x}_{10000}=(\bar{\bar x}_{10001}-\bar{\bar x}_{10000})\\= (1-50.5)=-49.5$$

Voila! We manufactured a giant change in the median while the mean barely moved. However, if you followed my analysis, you can see the trick: entire change in the median is coming from adding a new observation from the same distribution, not from replacing the valid observation with an outlier, which is, as expected, zero.

a counter factual, that is

Now, what would be a real counter factual? In all previous analysis I assumed that the outlier $O$ stands our from the valid observations with its magnitude outside usual ranges. These are the outliers that we often detect. What if its value was right in the middle?

Let's modify the example above:"... our data is 5000 ones and 5000 hundreds, and we add an outlier of ..." 20!

Let's break this example into components as explained above. As an example implies, the values in the distribution are 1s and 100s, and 20 is an outlier. So, we can plug $x_{10001}=1$, and look at the mean: $$\bar x_{10000+O}-\bar x_{10000} =\left(50.5-\frac{505001}{10001}\right)+\frac {20-\frac{505001}{10001}}{10001}\\\approx 0.00495-0.00305\approx 0.00190$$ The term $-0.00305$ in the expression above is the impact of the outlier value. It's is small, as designed, but it is non zero.

The break down for the median is different now! $$\bar{\bar x}_{10000+O}-\bar{\bar x}_{10000}=(\bar{\bar x}_{10001}-\bar{\bar x}_{10000})\\= (1-50.5)+(20-1)=-49.5+19=-30.5$$

In this example we have a nonzero, and rather huge change in the median due to the outlier that is 19 compared to the same term's impact to mean of -0.00305! This shows that if you have an outlier that is in the middle of your sample, you can get a bigger impact on the median than the mean.

conclusion

Note, there are myths and misconceptions in statistics that have a strong staying power. For instance, the notion that you need a sample of size 30 for CLT to kick in. Virtually nobody knows who came up with this rule of thumb and based on what kind of analysis. So, it is fun to entertain the idea that maybe this median/mean things is one of these cases. However, it is not. Indeed the median is usually more robust than the mean to the presence of outliers.