Solved – the advantage of median polish over the median

bioinformaticsmedian

I still can't figure out the advantage of median polish over a regular median for the summarisation of probe to make probesets in microarray analysis.

As I understand from here the idea is to use the median of the different arrays to summarise the data along with the median of the different probes: The use of median polish for feature selection

I've been making some simulations to try to see what might happen.

If we imagine the following is a simulation microarray dataset, where the columns are the probes we wish to summarise, and the rows are the different microarray chips:

library(preprocessCore)
>  y <- matrix(10+rnorm(100),20,5)
> y
           [,1]      [,2]      [,3]      [,4]      [,5]
 [1,]  9.334358  9.993648  8.551274 10.109988 10.243317
 [2,] 11.448786  8.908376 11.536720 10.679236 10.831209
 [3,] 11.979813  8.726539  9.740086 10.103683  9.349783
 [4,] 10.552108 12.772855 10.484486  9.362849  9.426693
 [5,] 10.056581  9.890734  9.907472 10.283063  9.909602
 [6,] 12.187766 10.290644  8.770036 11.241425 12.856710
 [7,] 11.071675  9.932000 11.761954 10.806470 11.013961
 [8,]  9.560737  9.234133 11.307681  8.672639  9.570637
 [9,]  8.952978  8.549438  9.962865  8.527808  8.421271
[10,]  8.853584 10.117102 10.040929  9.551693  9.880730
[11,]  8.794862 11.276158  7.579099  9.167762  8.863161
[12,] 10.923659  9.873338  9.081718  9.501927 11.956930
[13,] 10.150289  8.472951  9.367948  9.376648  9.963847
[14,] 10.271810 10.738851 11.253192  8.169423 10.286973
[15,] 10.424398 10.356835 10.004031 10.790331  9.922300
[16,] 10.494939  7.793422 10.182820  9.597307  8.726760
[17,] 11.447273 10.366120 11.620400  9.698011 10.706059
[18,]  9.597951 11.161659  9.923795  8.462029 10.945888
[19,]  8.796656 10.962160 10.204844  7.944251  9.332819
[20,]  9.462135 10.621933  9.697430 11.112817  9.905340

So we can summarise the columns ussing the median (which treats each probeset (that is, column) separately, and also using median polish (which takes into account the columns and the rows, more detail in http://www.stats.ox.ac.uk/pub/MASS4/VR4stat.pdf

> colSummarizeMedian(y)
$Estimates
[1] 10.211050 10.055375  9.983448  9.574500  9.915951

$StdErrors
[1] NA NA NA NA NA

> colSummarizeMedianpolish(y)
$Estimates
[1] 9.975078 9.898037 9.907471 9.788084 9.907471

$StdErrors
[1] NA NA NA NA NA

Then I add an "outlier" by changing one of the arrays and recalculate. As we would expect the median stays the same, but the median polish doesn't.

However, I don't see why the median polish is a better answer here than the median answer? I don't see why we are borrowing information from the other columns, and what we gain from this. It seems especially true since most of the time some of the columns will be chips from differential treatments.

Perhaps I have missed something rather fundamental here, but any help would be much appreciated, I am at somewhat of a mental block.

Also, if people think this is better suited to a more bioinformatics q+a site, please let me know!

> y[20,] <- y[20,] + rnorm(5, sd=5)
> colSummarizeMedian(y)
$Estimates
[1] 10.348104  9.990152  9.983448  9.526810  9.915951

$StdErrors
[1] NA NA NA NA NA

> colSummarizeMedianpolish(y)
$Estimates
[1] 10.030423  9.879213  9.933043  9.713632  9.893364

$StdErrors
[1] NA NA NA NA NA

Best Answer

What you call (linearly) "borrowing strength" corresponds to what statisticians refer to as affine equivariance. In essence, you want an affine equivariant estimator of location that is also robust to outliers. The best in class estimators are the SDE[1] and the FastMCD[2]

Both have several implementation in R. In both cases, the best implementation is probably in the rrcov package under the CovSde() and CovMcd() functions respectively.

library(MASS)
library(rrcov)
library(matrixStats)
CM<-matrix(0.95,5,5)
diag(CM)<-1
x<-mvrnorm(100,rep(0,5),CM)     
#the real data is correlated: you'd be better off borrowing 
#strength from the adjacent columns.    
z<-mvrnorm(10,rep(50,5),diag(5))    #the outliers
w<-rbind(x,z)

#all three essentially similar:
CovMcd(w)@center
CovSde(w)@center
colMeans(x)


#Not the same b/c of outliers
colMeans(w)
#Not the same b/c does not use the correlation structure:
colMedians(w)

[1] R. A. Maronna and V.J. Yohai (1995) The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90 (429), 330–341

[2] P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.