Solved – How to estimate the scale factor for MAD for a non-normal distribution

anomaly detectiondescriptive statisticsmachine learningmathematical-statisticsoutliers

I understand that the scale factor for normally distributed data is 1.4826 to convert it to a pseudo standard deviation like quantity which could be used with the median for determining confidence levels as standard deviation is used with the mean. As I understood after reading on how this "1.4826" value is reached is that it is the the inverse of the 0.75 quantile for the standard normal distribution (the logic for 0.75 is in turn is that at this quantile, 50% of the standard normal CDF is covered).

Now hopefully after this brief introduction, i want to ask what if the data is not normally distributed or the data distribution is not known in advance. How to approach the scaling factor problem then? I referred a paper by leys at al. [2013]. and he mentioned that the scaling factor should be 1/80th quantile, but is that after standardising the data?? Also what is the logic behind the 80th percentile. Is that an empirical result?

Thanks for your patience in reading the questions and providing your valuable insights in advance.

Best Answer

Definition: In R, the MAD of a vector x of observations is median(abs(x - median(x))) multiplied by the default constant you mention in your Question.

set.seed (726); x = round(rnorm(10, 100, 15))  # rounded-normal data
x
[1]  95  80 108  84 115  76  82  93 121 117

mad(x)
[1] 20.7564                  # default MAD in R
mad(x, const=1)
[1] 14                       # MAD with constant set to 1
median(abs(x-median(x)))
[1] 14                       # MAD using definition

The rationale for the constant $c = 1.4826$ is to put MAD on the same 'scale' as the sample standard deviation $S$ for large normal samples, in the sense that $E(S) \approx \sigma$ and $E(cD) \approx \sigma,$ where $S$ is the sample SD, $D$ is my notation for the sample MAD (without constant), and $\sigma$ is the SD of the normal population from which a large sample has been taken.

Illustrating with $n = 1000$ observations from $\mathsf{Norm}(\mu=100,\sigma=15):$

set.seed(725);  y = rnorm(1000, 100, 15)
sd(y);  mad(y)
[1] 14.64436         # sample SD, aprx pop SD 15
[1] 14.54209         # sample MAD, aprx same as sample and pop SD
mad(y, const=1)
[1] 9.808504         # MAD without constant multiple
sd(y)/mad(y, const=1)
[1] 1.493026         # Shows ratio aprx 1.4826

So with one normal sample of size $n =1000$ we have seen that a constant multiple of roughly 1.5 converts sample MAD $D$ to about the same scale as sample SD $S.$ We could get a more precise value for the constant by looking at many large normal samples.

[Note: If you use the same seed as shown, you will get exactly the same example. If you choose a different seed, or use no set.seed statement, you will get a fresh example.]

Uniform Data: If $U \sim \mathsf{Unif}(100-15\sqrt{3}, 100+15\sqrt{3}),$ then $E(U) = 100, SD(U) = 15.$

set.seed(1234);  u = runif(1000, 100-15*sqrt(3), 100+15*sqrt(3))
sd(u);  mad(u, const=1);  mad(u)
[1] 15.13162   # S
[1] 13.01111   # D
[1] 19.29028   # MAD with NORMAL constant
sd(u)/mad(u, const=1)
[1] 1.162977   # suggests UNIFORM const is aprx 1.16

So from one large uniform sample, we see that the constant for uniform data may be about $c = 1.16.$ Intuitively, it seems the same constant ought to work for all uniform populations. Here is a simulation using a 100,000 samples of size $n = 1000$ from a standard uniform distribution $\mathsf{Unif}(0,1).$ It shows the the constant for uniform data is about $c = 1.16.$ The 95% margin of simulation error for samples of size $n = 1000$ is about $\pm 0.0002.$ Larger samples might give a slightly different value.

m = 10^5;  n = 1000;  c = numeric(m)
for(i in 1:m) {
   u = runif(n);  s = sd(u);  d = mad(u, const=T)
   c[i] = s/d }
mean(c)
[1] 1.157575
2*sd(c)/sqrt(m)
[1] 0.0001560186

Exponential Data: An analogous simulation for exponential data gives $c \approx 2.08.$

Laplace Data: For random samples from a Laplace distribution the sample MAD is preferred to the sample SD as an estimate of the Laplace scale parameter. For Laplace data my simulation showed that $c \approx 2.04.$