Solved – Median + MAD for skewed data

madmedianoutliersskewness

I am trying to figure out what happens if you apply Hampel's outlier detection technique based on the median and the MAD to data that is skewed. Apparently, the advantage of Hampel's method over z-scores is that it is much less influenced by outliers itself. However, several papers and websites say that this method should not be applied when the data distribution is skewed, so when the data is not normally distributed. However, I did not find any literature about what happens if you apply this method to skewed data. Does it not detected any outliers at all? Or does it detected false positives? I found several questions in this forum about whether using z-scores or Hampel's approach and even when data is skewed but no one gave an answer of what the outcome of Hampel's method is when it is applied to skewed data.

The closest comment I found in this forum is the following:

"Using the MAD amounts to assuming that the underlying distribution is symmetric (deviations above the median and below the median are considered equally). If your data is skewed this is clearly wrong: it will lead you to overestimating the true variability of your data." Mean$\pm$SD or Median$\pm$MAD to summarise a highly skewed variable?

It says "it will lead you to overestimating the true variability of your data" but what does that actually mean? Does it lead to the identification of too many or too less outliers?

In addition, can anyone see a problem of applying this technique to studies of small sample sizes compared to z-scores??

Can anyone help to shed light on that?

Best Answer

If the uncontaminated data in your sample is drawn from an asymmetric distribution and the measure of scale you use to determine the width of the rejection region assumes that the good part of your data is symmetric, then, these rejection regions will be larger than they need to be. For illustration, if the distribution of the data is really right skewed. This would lead you to

  • Reject genuine observations from the right tail as outliers.
  • Fail to detect outliers from the left tail for what they are.

Overall, the combined effect would be that your (inappropriately) cleaned dataset will look more symmetric than it really is.

The alternative here is to use an outlier detection rule that treats the left and right tails of your sample separately. Of course, compared to the mad and median, this will also halve the breakdown point of your procedure (this is inevitable because the contamination rate of an half sample can be potentially twice as high as the contamination rate the full sample).

In my opinion, the best procedure for this problem is to use the rejection regions from the adjusted boxplots. In my experience (drawn from numerical simulation), they can be expected to reliably detect asymmetric contaminations even when the data contains as much as 10-15% outliers concentrated in one tail. Adjusted boxplots are widely implemented and their connection with the classical boxplots makes them easy to understand and use. This answer explains and illustrates the use of adjusted boxplots in a context quiet like yours.

Related Question