Solved – Iglewicz and Hoaglin outlier test with modified z-scores – How to do if the MAD becomes 0

outliersrobust

I'm a programmer with a small statistics background and I need to find outliers in a small list of integers and floats.

After some search on google I found the Iglewicz and Hoaglin outlier test which creates a modified z-score M_i for every value in the list and check it against an threshold (normally 3.5).

$$M_{i} = \frac{0.6745(x_{i} – \tilde{x})} {\mbox{MAD}}$$

I wrote a litte python script to test it. At first it worked great, but after a few tests I spotted an error.

If you try to find outliers (with my script) in an list with many identically values and one outlier e.g. data = [10, 10, 10, 10, 10, 10, 10, 100] the MAD(median absolute deviation) becomes 0 and this leads my to my question: "What should I do if the MAD becomes 0?".

My first idea was to set the MAD to ∞, but this causes the script to find no outliers.

My second idea was to add very small offsets to the values to make them unique e.g. data = [10.0, 10.00000001, 10.00000002, 10.00000003, 10.00000004, 10.00000004, 10.00000005, 100]. This way the MAD can't become 0 and my script is able to detect the outlier 100.

Does somebody have better ideas?

Am I doing something wrong?

Best Answer

1. A practical suggestion.

Change this part of the code

    if mad == 0:
        mad = 9223372036854775807 # maxint

    if mad == 0:
        mad = 2.2250738585072014e-308 # sys.float_info.min

It does the trick. Division by this number blows up the Iglewicz-Hoaglin test statistic – exactly as desired. That is, marking strongly deviant observations as outliers.

2. Previous practical suggestion.

What you could do, is check if it works with the closely related definition of mean absolute error (MAE):

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |x_i - \text{median}(x)|, $$

with $e_i = x_i - \text{median}(x)$ the errors (better: residuals, or, deviations).

IBM uses this variant:

$$ M_{i} = \frac{x_{i} - \text{median}(x)} { 1.253314 \cdot \text{MAE} } $$

for the if MAD == 0 case.

3. What is going on here? (From a programming perspective)

Consider the two cases:

$0/0$,
$x/0$ for $x \neq 0$.

Scientific programming languages R, Matlab and Julia have the following behavior:

0/0 returns NaN.
90/0 returns Inf.

Python, on the other hand, throws a ZeroDivisionError in both cases.

Practical suggestion one circumvents both cases for both flavors of zero-division handling.

Best Answer

Related Solutions

Solved – Median + MAD for skewed data

Solved – Outlier detection/imputation – discussion

Related Question