Solved – How accurate is IQR for detecting outliers

meanoutliersreliability

I'm writing a script that analyses run times of processes. I am not sure of their distribution but I want to know if a process runs "too long". So far I've been using 3 standard deviations of the last run times (n>30), but I was told that this does not provide anything useful if the data is not normal (which it does not appear to be). I found another outlier test that states:

Find the inter quartile range, which is IQR = Q3 – Q1, where Q3 is the third quartile and Q1 is the first quartile. Then find these two numbers:

a) Q1 – 1.5*IQR
b) Q3 + 1.5*IQR

The point is an outlier if < a or > b

My data tends to be things like 2sec, 3sec, 2sec, 5sec, 300sec, 4sec, …. where 300sec is obviously an outlier.

Which method is better? The IQR method or the std deviation method?

Best Answer

There really are entire books on outliers.

The usual specific answer is as that the standard deviation is pulled up by outliers, so any rule based on the SD may perform poorly.

The Tukey rules on quartiles +/- 1.5 IQR you quote came out of handwork with small and moderate-sized datasets in the 1970s, and were designed to indicate values you might want to think about individually. It is not clear that they carry over to much larger datasets, nor that they apply when you expect considerable skewness.

A more general answer is that an outlier rule is good if it always makes the right decisions, but how can you tell?

This is contentious territory, but I'd expect an outlier to stick out on a graph as being very different from others. But it is often (usually?) a tough call to tell the difference between what you expect in a heavy-tailed distribution and what is too wild to regard as anything but an outlier. Sometimes transformation makes an outlier look much more ordinary.

Furthermore, if you use robust methods you might worry a bit less about precisely which values merit being called outliers, but worry rather about outliers in general.

EDIT 20 July 2022 A detail that can easily bite is that the IQR can be 0 without the data being in any sense pathological or bizarre. A simple example is any (0, 1) variable with at least 75% values equal to 0 or at least 75% values equal to 1. Then all values not the majority value could be declared outliers, which doesn't match what a statistically experienced person would usually want. Similar awkwardness can easily arise with ordered categorical or counted variables with a small number of values in practice.

Related Question