Solved – Outlier detection for heavy-tailed data

Applying modified z-score for outlier elimination on some data (Iglewicz and Hoaglin, 1993), I discovered that a big proportion of the data (~10%) was outside the range abs(z)>=3.5. Further investigation showed that the data is heavy-tailed. I assumed that the Bienaymé–Chebyshev inequality would hold for the median absolute deviation MAD, too, but obviously does not.

.sum.bool  <- function(x) c('TRUE'=sum(x),'FALSE'=sum(!x),
                            'TRUE %'=round(sum(x)/length(x)*100,1), length=length(x))

rrn <- rnorm(10000)
rrt <- rt(10000,1)

# simplified z-score for demonstration purposes
mad.outlier <- function(x)abs(x-mean(x))/mad(x) > 3
sd.outlier <- function(x)abs(x-mean(x))/sd(x) > 3

rbind(mad.n=.sum.bool(mad.outlier(rrn)),
      sd.n=.sum.bool(sd.outlier(rrn)),
      mad.t=.sum.bool(mad.outlier(rrt)),
      sd.t=.sum.bool(sd.outlier(rrt)))

On the heavy-tailed t-distribution with 1df, 14% of the data are outside 3 MADs.

      TRUE FALSE TRUE % length
mad.n   29  9971    0.3  10000
sd.n    29  9971    0.3  10000
mad.t 1381  8619   13.8  10000
sd.t    33  9967    0.3  10000

Can someone shed light on this property of the MAD/z-score in the presence of heavy-tailed distributions? What are recommendations for outlier detection for heavy-tailed data?

Best Answer

Ignoring the tails, the Gaussian and Cauchy (T-dist w/ DF=1) look pretty similar in their meaty center. The MAD only looks at the meaty center (more-or-less). The MAD estimates will be pretty similar, which will give a pretty similar range of "acceptable". The Cauchy, with it's fat tails, will violate that acceptable range more often.

I'm not sure what your intentions are with this experiment though. In my experience, most real world data would lie somewhere in the middle of the Gaussian to Cauchy spectrum. When I apply robust statistics, I really just want to focus on capping the influence of any single point. There isn't really a "correct" answer. Robust statistics are more focused on reasonable estimates that match the bulk of your data (but not all of it). In the real world, all definitions of "outlier" are subjective. You need to tune acceptance levels to fit your own needs.

Best Answer

Related Solutions

MAD for Outlier Detection – Median Absolute Deviation Formula

Solved – Linear Regression with heavy tailed noise

Related Question