Solved – Calculating robust z scores with median and MAD

descriptive statisticsmadmathematical-statisticsrz-score

Could someone explain the scaling factors involved in calculating robust z scores using median and MAD please?

As I understand it, conventional Z scores calculated using the mean and SD are sensitive to outliers in the data. An alternative is to use the median and median-absolute-deviation (MAD).

The formula for MAD is: MAD = median(| x – median(x)|)

However, in R, the MAD of a vector x of observations is median(abs(x – median(x))) multiplied by the default constant 1.4826 (scale factor for MAD for non-normal distribution), which is used to put MAD on the same scale as the data and assumes normally distributed data.

I'm confused as to how this fits in to computing robust z scores. I have seen this calculated as:

Robust z-score = (xi – x̃) / MAD (where xi: A single data value and x̃: The median of the dataset).

Also, I have seen:

Robust z-score = 0.6745(xi – x̃) / MAD

Which of these is correct? Does the MAD calculation above include the b constant 1.4826, or is the constant set to 1?

Furthermore, I've read that the standard b constant handles skewed data pretty well, but one could calculate b independently. I am dealing with slightly skewed data that follows a poisson distribution.

Any insight and suggestions would be greatly appreciated!

Best Answer

You write: "However, in R, the MAD of a vector x of observations is median(abs(x - median(x))) multiplied by the default constant 1.4826 (scale factor for MAD for non-normal distribution), which is used to put MAD on the same scale as the data and assumes normally distributed data."

But this is not quite how it is. The MAD multiplied by the factor 1.4826 is as estimator consistent for the $\sigma$ of a ${\cal N}(0,\sigma^2)$ distribution. The factor is used in order to put the MAD on the same scale not "as the data", but as the standard estimator of the normal standard deviation. The normal distribution here is not an "assumption" but rather a calibration tool; the MAD is multiplied with a factor that under normality will give you (asymptotically at least) the same as the standard estimator. This means that the size of the robust z-scores is also comparable with the size of the standard z-scores, and quantiles from the normal distribution can be used, for example, for outlier detection. This does not mean that the data have to be normal, as the MAD is not affected by outliers regardless of whether multiplied by 1.4826 or not. It rather means that if the majority of the data look like coming from a normal distribution, robust z-scores can be used to detect outliers that are not in line with normality, because they are unaffected by these outliers, as opposed to the standard z-scores, and multiplication by 1.4826 makes sure that expected robust z-scores for non-outliers are in the same ball park as nonrobust z-scores in case no outliers exist.

Oh, and in case this isn't clear anyway: 0.6745=1/1.4826, so the formula involving 0.6745 just comes from multiplying the MAD in the denominator by 1.4826.

Warning! Very occasionally I have seen the term MAD used so that the factor 1.4826 is already included, and this can of course be a source of confusion. However I believe that the majority uses the notation as you did, defining the MAD without any factor (or, equivalently, with factor 1), which then afterwards is multiplied by 1.4826 outside the definition of MAD.