Solved – modified z score

medianoutliersz-score

I am using Modified Z-Score to find out outliers on a time series data on exit rate for a website.
N = 1131. Based on last 3 years daily data (1096 values), i am finding out outliers for the remaining values.
Formula i used for Modified Z score is 0.6745 * (Yi – Ymedian)/MAD.
Yi = Actual Value
Ymedian – median of entire dataset.
MAD = Median(Abs(values – Median(Values)))

As per Iglewicz & Hoaglin article, it suggests Modified Z-Score > 3.5 as a outlier. When i apply that rule, it suggests my data has no outliers…
My question is can we change 3.5 to 2.5 or 2? If Yes, how do we determine what should be the cut off?

Best Answer

Your dataset seems to be smaller. So you may use the following:

import numpy as np
def outliers_modified_zscore(v): #v = your array of values
    #ST DEV ≈ 1.155 MAD (uniform dist)
    #ST DEV ≈ 1.254 MAD for small samples (MAD <2.5) and 1.4826 MAD for large samples (MAD>=2.5) (normal dist)

    threshold = 3.762
    median = np.median(v)
    median_absolute_deviation = np.median([np.abs(y - median) for y in v])
    modified_z_scores = [0.7974 * (y - median) / median_absolute_deviation for y in v] #instead of 0.6745
    outliers = [v[i] for i in range(len(modified_z_scores)) if np.abs(modified_z_scores[i]) > threshold]
    return outliers

You may reduce the threshold as sometimes there are no "extreme" outliers in the data.

More information: MAD & Standard Deviation https://blog.arkieva.com/relationship-between-mad-standard-deviation/ https://blog.arkieva.com/mad-versus-standard-deviation/