Solved – Median of medians as robust mean of means

madrobust

The location and scale of a normally distributed data can be estimated by sampling the data then taking the mean of the sample means and standard deviations, respectively. For non-normal (heavy-tailed) data, is it correct to take the median of the sample medians and IQR/MAD, instead? That is, is it correct to use the median of sample medians as a robust estimator of location similar to the mean of sample means for normal data?

Best Answer

If all the samples come from the same distribution, then yes the median of the sample medians is a fairly robust estimate of the median of the underlying distribution (though this need not be the same as the mean), since the median of a sample from a continuous distribution has probability 0.5 of being below (or above) the population median.

Added

Here is some illustrative R code. It takes a sample from a normal distribution and a case with outliers where 1% of data is 10,000 times bigger than it should be. It looks at the various statistics for the overall sample data (50,000 points) and then by the centre (mean or median) of the statistics of the 10,000 samples with 5 points in each sample.

library(matrixStats)

wholestats <- function(x,n) {
     mea <- sum(x)/n  
     var <- sum((x-mea)^2)/(n-1)
     sdv <- sqrt(var) 
     qun <- quantile(x, probs=c(0.25,0.5,0.75))
     mad <- median(abs(x-qun[2]))
     c(mean=mea, variance=var, st.dev=sdv, 
       median=qun[2], IQR=qun[3]-qun[1], 
       MAD=mad)
    }

rowstats <- function(x,b) {
     rmea <- rowSums(x)/b     
     rvar <- rowSums((x-rmea)^2)/(b-1)
     rsdv <- sqrt(rvar)
     rqun <- rowQuantiles(x, probs=c(0.25,0.5,0.75))  
     rmad <- rowMedians(abs(x-rqun[,2]))
     c(mean=mean(rmea), variance=mean(rvar), st.dev=mean(rsdv), 
       median=median(rqun[,2]), IQR=median(rqun[,3]-rqun[,1]), 
       MAD=median(rmad))
    }

a <- 10000 # number of samples
b <- 5     # samplesize

set.seed(1)
d <- array(rnorm(a*b), dim=c(a,b))
doutlier <- array(d * ifelse(runif(a*b)>0.99, 10000, 1) , dim=c(a,b))

The median based statistics as expected are more robust, though they fail to show that the heavy tailed outlier variant is heavy tailed.

> wholestats(d,a*b)
        mean     variance       st.dev   median.50%      IQR.75%          MAD 
-0.002440456  1.011306552  1.005637386 -0.001610677  1.357029247  0.678706371 
> wholestats(doutlier,a*b) 
         mean      variance        st.dev    median.50%       IQR.75%           MAD 
-3.425664e+00  9.591583e+05  9.793663e+02 -1.610677e-03  1.373658e+00  6.871415e-01 
> rowstats(d,b)
        mean     variance       st.dev       median          IQR          MAD 
-0.002440456  1.014611308  0.947630870  0.003460172  0.917642167  0.510115277 
> rowstats(doutlier,b) 
         mean      variance        st.dev        median           IQR           MAD 
-3.425664e+00  9.607212e+05  1.685929e+02  3.460172e-03  9.301795e-01  5.175084e-01 
Related Question