I'm a programmer with a small statistics background and I need to find outliers in a small list of integers and floats.
After some search on google I found the Iglewicz and Hoaglin outlier test which creates a modified z-score Mi for every value in the list and check it against an threshold (normally 3.5
).
$$M_{i} = \frac{0.6745(x_{i} – \tilde{x})} {\mbox{MAD}}$$
I wrote a litte python script to test it. At first it worked great, but after a few tests I spotted an error.
If you try to find outliers (with my script) in an list with many identically values and one outlier e.g. data = [10, 10, 10, 10, 10, 10, 10, 100]
the MAD(median absolute deviation)
becomes 0
and this leads my to my question: "What should I do if the MAD
becomes 0
?".
My first idea was to set the MAD
to ∞
, but this causes the script to find no outliers.
My second idea was to add very small offsets to the values to make them unique e.g. data = [10.0, 10.00000001, 10.00000002, 10.00000003, 10.00000004, 10.00000004, 10.00000005, 100]
. This way the MAD
can't become 0
and my script is able to detect the outlier 100.
Does somebody have better ideas?
Am I doing something wrong?
Best Answer
1. A practical suggestion.
Change this part of the code
to
It does the trick. Division by this number blows up the Iglewicz-Hoaglin test statistic – exactly as desired. That is, marking strongly deviant observations as outliers.
2. Previous practical suggestion.
What you could do, is check if it works with the closely related definition of mean absolute error (MAE):
$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |x_i - \text{median}(x)|, $$
with $e_i = x_i - \text{median}(x)$ the errors (better: residuals, or, deviations).
IBM uses this variant:
$$ M_{i} = \frac{x_{i} - \text{median}(x)} { 1.253314 \cdot \text{MAE} } $$
for the
if MAD == 0
case.3. What is going on here? (From a programming perspective)
Consider the two cases:
Scientific programming languages R, Matlab and Julia have the following behavior:
0/0
returnsNaN
.90/0
returnsInf
.Python, on the other hand, throws a
ZeroDivisionError
in both cases.Practical suggestion one circumvents both cases for both flavors of zero-division handling.