I understand that the scale factor for normally distributed data is 1.4826 to convert it to a pseudo standard deviation like quantity which could be used with the median for determining confidence levels as standard deviation is used with the mean. As I understood after reading on how this "1.4826" value is reached is that it is the the inverse of the 0.75 quantile for the standard normal distribution (the logic for 0.75 is in turn is that at this quantile, 50% of the standard normal CDF is covered).
Now hopefully after this brief introduction, i want to ask what if the data is not normally distributed or the data distribution is not known in advance. How to approach the scaling factor problem then? I referred a paper by leys at al. [2013]. and he mentioned that the scaling factor should be 1/80th quantile, but is that after standardising the data?? Also what is the logic behind the 80th percentile. Is that an empirical result?
Thanks for your patience in reading the questions and providing your valuable insights in advance.
Best Answer
Definition: In R, the MAD of a vector
x
of observations ismedian(abs(x - median(x)))
multiplied by the default constant you mention in your Question.The rationale for the constant $c = 1.4826$ is to put MAD on the same 'scale' as the sample standard deviation $S$ for large normal samples, in the sense that $E(S) \approx \sigma$ and $E(cD) \approx \sigma,$ where $S$ is the sample SD, $D$ is my notation for the sample MAD (without constant), and $\sigma$ is the SD of the normal population from which a large sample has been taken.
Illustrating with $n = 1000$ observations from $\mathsf{Norm}(\mu=100,\sigma=15):$
So with one normal sample of size $n =1000$ we have seen that a constant multiple of roughly 1.5 converts sample MAD $D$ to about the same scale as sample SD $S.$ We could get a more precise value for the constant by looking at many large normal samples.
[Note: If you use the same seed as shown, you will get exactly the same example. If you choose a different seed, or use no
set.seed
statement, you will get a fresh example.]Uniform Data: If $U \sim \mathsf{Unif}(100-15\sqrt{3}, 100+15\sqrt{3}),$ then $E(U) = 100, SD(U) = 15.$
So from one large uniform sample, we see that the constant for uniform data may be about $c = 1.16.$ Intuitively, it seems the same constant ought to work for all uniform populations. Here is a simulation using a 100,000 samples of size $n = 1000$ from a standard uniform distribution $\mathsf{Unif}(0,1).$ It shows the the constant for uniform data is about $c = 1.16.$ The 95% margin of simulation error for samples of size $n = 1000$ is about $\pm 0.0002.$ Larger samples might give a slightly different value.
Exponential Data: An analogous simulation for exponential data gives $c \approx 2.08.$
Laplace Data: For random samples from a Laplace distribution the sample MAD is preferred to the sample SD as an estimate of the Laplace scale parameter. For Laplace data my simulation showed that $c \approx 2.04.$