Solved – Median of medians as robust mean of means

madrobust

The location and scale of a normally distributed data can be estimated by sampling the data then taking the mean of the sample means and standard deviations, respectively. For non-normal (heavy-tailed) data, is it correct to take the median of the sample medians and IQR/MAD, instead? That is, is it correct to use the median of sample medians as a robust estimator of location similar to the mean of sample means for normal data?

Best Answer

If all the samples come from the same distribution, then yes the median of the sample medians is a fairly robust estimate of the median of the underlying distribution (though this need not be the same as the mean), since the median of a sample from a continuous distribution has probability 0.5 of being below (or above) the population median.

Added

Here is some illustrative R code. It takes a sample from a normal distribution and a case with outliers where 1% of data is 10,000 times bigger than it should be. It looks at the various statistics for the overall sample data (50,000 points) and then by the centre (mean or median) of the statistics of the 10,000 samples with 5 points in each sample.

library(matrixStats)

wholestats <- function(x,n) {
     mea <- sum(x)/n  
     var <- sum((x-mea)^2)/(n-1)
     sdv <- sqrt(var) 
     qun <- quantile(x, probs=c(0.25,0.5,0.75))
     mad <- median(abs(x-qun[2]))
     c(mean=mea, variance=var, st.dev=sdv, 
       median=qun[2], IQR=qun[3]-qun[1], 
       MAD=mad)
    }

rowstats <- function(x,b) {
     rmea <- rowSums(x)/b     
     rvar <- rowSums((x-rmea)^2)/(b-1)
     rsdv <- sqrt(rvar)
     rqun <- rowQuantiles(x, probs=c(0.25,0.5,0.75))  
     rmad <- rowMedians(abs(x-rqun[,2]))
     c(mean=mean(rmea), variance=mean(rvar), st.dev=mean(rsdv), 
       median=median(rqun[,2]), IQR=median(rqun[,3]-rqun[,1]), 
       MAD=median(rmad))
    }

a <- 10000 # number of samples
b <- 5     # samplesize

set.seed(1)
d <- array(rnorm(a*b), dim=c(a,b))
doutlier <- array(d * ifelse(runif(a*b)>0.99, 10000, 1) , dim=c(a,b))

The median based statistics as expected are more robust, though they fail to show that the heavy tailed outlier variant is heavy tailed.

> wholestats(d,a*b)
        mean     variance       st.dev   median.50%      IQR.75%          MAD 
-0.002440456  1.011306552  1.005637386 -0.001610677  1.357029247  0.678706371 
> wholestats(doutlier,a*b) 
         mean      variance        st.dev    median.50%       IQR.75%           MAD 
-3.425664e+00  9.591583e+05  9.793663e+02 -1.610677e-03  1.373658e+00  6.871415e-01 
> rowstats(d,b)
        mean     variance       st.dev       median          IQR          MAD 
-0.002440456  1.014611308  0.947630870  0.003460172  0.917642167  0.510115277 
> rowstats(doutlier,b) 
         mean      variance        st.dev        median           IQR           MAD 
-3.425664e+00  9.607212e+05  1.685929e+02  3.460172e-03  9.301795e-01  5.175084e-01

Related Solutions

Bayesian-Model – Developing a Robust Bayesian Model for Estimating Scale of Roughly Normal Distribution

Bayesian inference in a T noise model with an appropriate prior will give a robust estimate of location and scale. The precise conditions that the likelihood and prior need to satisfy are given in the paper Bayesian robustness modelling of location and scale parameters by Andrade and O'Hagan (2011). The estimates are robust in the sense that a single observation cannot make the estimates arbitrarily large, as demonstrated in figure 2 of the paper.

When the data is normally distributed, the SD of the fitted T distribution (for fixed $\nu$) does not match the SD of the generating distribution. But this is easy to fix. Let $\sigma$ be the standard deviation of the generating distribution and let $s$ be the standard deviation of the fitted T distribution. If the data is scaled by 2, then from the form of the likelihood we know that $s$ must scale by 2. This implies that $s = \sigma f(\nu)$ for some fixed function $f$. This function can be computed numerically by simulation from a standard normal. Here is the code to do this:

library(stats)
library(stats4)
y = rnorm(100000, mean=0,sd=1)
nu = 4
nLL = function(s) -sum(stats::dt(y/s,nu,log=TRUE)-log(s))
fit = mle(nLL, start=list(s=1), method="Brent", lower=0.5, upper=2)
# the variance of a standard T is nu/(nu-2)
print(coef(fit)*sqrt(nu/(nu-2)))

For example, at $\nu=4$ I get $f(\nu)=1.18$. The desired estimator is then $\hat{\sigma} = s/f(\nu)$.

Solved – Median absolute deviation (MAD) and SD of different distributions

To address the question in comments:

I would like to know if there is a possible range of values of the constant

(I assume the question is intended to be about the median deviation from median.)

The ratio of SD to MAD can be made arbitrarily large.

Take some distribution with a given ratio of SD to MAD. Hold the middle $50\%+\epsilon$ of the distribution fixed (which means MAD is unchanged). Move the tails out further. SD increases. Keep moving it beyond any given finite bound.
The ratio of SD to MAD can easily be made as near to $\sqrt{\frac{1}{2}}$ as desired by (for example) putting $25\%+\epsilon$ at $\pm 1$ and $50\%-2\epsilon$ at 0.

I think that would be as small as it goes.

enter image description here

Best Answer

Related Solutions

Bayesian-Model – Developing a Robust Bayesian Model for Estimating Scale of Roughly Normal Distribution

Solved – Median absolute deviation (MAD) and SD of different distributions

Related Question