Solved – Calculating robust z scores with median and MAD

descriptive statisticsmadmathematical-statisticsrz-score

Could someone explain the scaling factors involved in calculating robust z scores using median and MAD please?

As I understand it, conventional Z scores calculated using the mean and SD are sensitive to outliers in the data. An alternative is to use the median and median-absolute-deviation (MAD).

The formula for MAD is: MAD = median(| x – median(x)|)

However, in R, the MAD of a vector x of observations is median(abs(x – median(x))) multiplied by the default constant 1.4826 (scale factor for MAD for non-normal distribution), which is used to put MAD on the same scale as the data and assumes normally distributed data.

I'm confused as to how this fits in to computing robust z scores. I have seen this calculated as:

Robust z-score = (xi – x̃) / MAD (where xi: A single data value and x̃: The median of the dataset).

Also, I have seen:

Robust z-score = 0.6745(xi – x̃) / MAD

Which of these is correct? Does the MAD calculation above include the b constant 1.4826, or is the constant set to 1?

Furthermore, I've read that the standard b constant handles skewed data pretty well, but one could calculate b independently. I am dealing with slightly skewed data that follows a poisson distribution.

Any insight and suggestions would be greatly appreciated!

Best Answer

You write: "However, in R, the MAD of a vector x of observations is median(abs(x - median(x))) multiplied by the default constant 1.4826 (scale factor for MAD for non-normal distribution), which is used to put MAD on the same scale as the data and assumes normally distributed data."

But this is not quite how it is. The MAD multiplied by the factor 1.4826 is as estimator consistent for the $\sigma$ of a ${\cal N}(0,\sigma^2)$ distribution. The factor is used in order to put the MAD on the same scale not "as the data", but as the standard estimator of the normal standard deviation. The normal distribution here is not an "assumption" but rather a calibration tool; the MAD is multiplied with a factor that under normality will give you (asymptotically at least) the same as the standard estimator. This means that the size of the robust z-scores is also comparable with the size of the standard z-scores, and quantiles from the normal distribution can be used, for example, for outlier detection. This does not mean that the data have to be normal, as the MAD is not affected by outliers regardless of whether multiplied by 1.4826 or not. It rather means that if the majority of the data look like coming from a normal distribution, robust z-scores can be used to detect outliers that are not in line with normality, because they are unaffected by these outliers, as opposed to the standard z-scores, and multiplication by 1.4826 makes sure that expected robust z-scores for non-outliers are in the same ball park as nonrobust z-scores in case no outliers exist.

Oh, and in case this isn't clear anyway: 0.6745=1/1.4826, so the formula involving 0.6745 just comes from multiplying the MAD in the denominator by 1.4826.

Warning! Very occasionally I have seen the term MAD used so that the factor 1.4826 is already included, and this can of course be a source of confusion. However I believe that the majority uses the notation as you did, defining the MAD without any factor (or, equivalently, with factor 1), which then afterwards is multiplied by 1.4826 outside the definition of MAD.

Related Solutions

Solved – Median of medians as robust mean of means

If all the samples come from the same distribution, then yes the median of the sample medians is a fairly robust estimate of the median of the underlying distribution (though this need not be the same as the mean), since the median of a sample from a continuous distribution has probability 0.5 of being below (or above) the population median.

Added

Here is some illustrative R code. It takes a sample from a normal distribution and a case with outliers where 1% of data is 10,000 times bigger than it should be. It looks at the various statistics for the overall sample data (50,000 points) and then by the centre (mean or median) of the statistics of the 10,000 samples with 5 points in each sample.

library(matrixStats)

wholestats <- function(x,n) {
     mea <- sum(x)/n  
     var <- sum((x-mea)^2)/(n-1)
     sdv <- sqrt(var) 
     qun <- quantile(x, probs=c(0.25,0.5,0.75))
     mad <- median(abs(x-qun[2]))
     c(mean=mea, variance=var, st.dev=sdv, 
       median=qun[2], IQR=qun[3]-qun[1], 
       MAD=mad)
    }

rowstats <- function(x,b) {
     rmea <- rowSums(x)/b     
     rvar <- rowSums((x-rmea)^2)/(b-1)
     rsdv <- sqrt(rvar)
     rqun <- rowQuantiles(x, probs=c(0.25,0.5,0.75))  
     rmad <- rowMedians(abs(x-rqun[,2]))
     c(mean=mean(rmea), variance=mean(rvar), st.dev=mean(rsdv), 
       median=median(rqun[,2]), IQR=median(rqun[,3]-rqun[,1]), 
       MAD=median(rmad))
    }

a <- 10000 # number of samples
b <- 5     # samplesize

set.seed(1)
d <- array(rnorm(a*b), dim=c(a,b))
doutlier <- array(d * ifelse(runif(a*b)>0.99, 10000, 1) , dim=c(a,b))

The median based statistics as expected are more robust, though they fail to show that the heavy tailed outlier variant is heavy tailed.

> wholestats(d,a*b)
        mean     variance       st.dev   median.50%      IQR.75%          MAD 
-0.002440456  1.011306552  1.005637386 -0.001610677  1.357029247  0.678706371 
> wholestats(doutlier,a*b) 
         mean      variance        st.dev    median.50%       IQR.75%           MAD 
-3.425664e+00  9.591583e+05  9.793663e+02 -1.610677e-03  1.373658e+00  6.871415e-01 
> rowstats(d,b)
        mean     variance       st.dev       median          IQR          MAD 
-0.002440456  1.014611308  0.947630870  0.003460172  0.917642167  0.510115277 
> rowstats(doutlier,b) 
         mean      variance        st.dev        median           IQR           MAD 
-3.425664e+00  9.607212e+05  1.685929e+02  3.460172e-03  9.301795e-01  5.175084e-01

Solved – Median + MAD for skewed data

If the uncontaminated data in your sample is drawn from an asymmetric distribution and the measure of scale you use to determine the width of the rejection region assumes that the good part of your data is symmetric, then, these rejection regions will be larger than they need to be. For illustration, if the distribution of the data is really right skewed. This would lead you to

Reject genuine observations from the right tail as outliers.
Fail to detect outliers from the left tail for what they are.

Overall, the combined effect would be that your (inappropriately) cleaned dataset will look more symmetric than it really is.

The alternative here is to use an outlier detection rule that treats the left and right tails of your sample separately. Of course, compared to the mad and median, this will also halve the breakdown point of your procedure (this is inevitable because the contamination rate of an half sample can be potentially twice as high as the contamination rate the full sample).

In my opinion, the best procedure for this problem is to use the rejection regions from the adjusted boxplots. In my experience (drawn from numerical simulation), they can be expected to reliably detect asymmetric contaminations even when the data contains as much as 10-15% outliers concentrated in one tail. Adjusted boxplots are widely implemented and their connection with the classical boxplots makes them easy to understand and use. This answer explains and illustrates the use of adjusted boxplots in a context quiet like yours.

Best Answer

Related Solutions

Solved – Median of medians as robust mean of means

Solved – Median + MAD for skewed data

Related Question