Solved – The use of median polish for feature selection

feature selectiongeneticsmedian

In a paper I was reading recently I came across the following bit in their data analysis section:

The data table was then split into tissues and cell lines, and the two subtables were separately median polished (the rows and columns were iteratively adjusted to have median 0) before being rejoined into a single table. We finally then selected for the subset of genes whose expression varied by at least 4-fold from the median in this sample set in at least three of the samples tested

I have to say I don't really follow the reasoning here. I was wondering if you could help me answer the following two questions:

Why is it desirably/helpful to adjust the median in the datasets? Why should it be done separately for different type of samples?
How is this not modifying the experimental data? Is this a known way of picking a number of genes/variables from a large set of data, or is it rather adhoc?

Thanks,

Best Answer

Tukey Median Polish, algorithm is used in the RMA normalization of microarrays. As you may be aware, microarray data is quite noisy, therefore they need a more robust way of estimating the probe intensities taking into account of observations for all the probes and microarrays. This is a typical model used for normalizing intensities of probes across arrays.

$$Y_{ij} = \mu_{i} + \alpha_{j} + \epsilon_{ij}$$ $$i=1,\ldots,I \qquad j=1,\ldots, J$$

Where $Y_{ij}$ is the $log$ transformed PM intensity for the $i^{th}$probe on the $j^{th}$ array. $\epsilon_{ij}$ are background noise and they can be assumed to correspond to noise in normal linear regression. However, a distributive assumption on $\epsilon$ may be restrictive, therefore we use Tukey Median Polish to get the estimates for $\hat{\mu_i}$ and $\hat{\alpha_j}$. This is a robust way of normalizing across arrays, as we want to separate signal, the intensity due to probe, from the array effect, $\alpha$. We can obtain the signal by normalizing for the array effect $\hat{\alpha_j}$ for all the arrays. Thus, we are only left with the probe effects plus some random noise.

The link that I have quoted before uses Tukey median polish to estimate the differentially expressed genes or "interesting" genes by ranking by the probe effect. However, the paper is pretty old, and probably at that time people were still trying to figure out how to analyze microarray data. Efron's non-parametric empirical Bayesian methods paper came in 2001, but probably may not have been widely used.

However, now we understand a lot about microarrays (statistically) and are pretty sure about their statistical analysis.

Microarray data is pretty noisy and RMA (which uses Median Polish) is one of the most popular normalization methods, may be because of its simplicity. Other popular and sophisticated methods are: GCRMA, VSN. It is important to normalize as the interest is probe effect and not array effect.

As you expect, the analysis could have benefited by some methods which take advantage of information borrowing across genes. These may include, Bayesian or empirical Bayesian methods. May be the paper that you are reading is old and these techniques weren't out until then.

Regarding your second point, yes they are probably modifying the experimental data. But, I think, this modification is for a better cause, hence justifiable. The reason being

a) Microarray data are pretty noisy. When the interest is probe effect, normalizing data by RMA, GCRMA, VSN, etc. is necessary and may be taking advantage of any special structure in the data is good. But I would avoid doing the second part. This is mainly because if we don't know the structure in advance, it is better not impose a lot of assumptions.

b) Most of the microarray experiments are exploratory in their nature, that is, the researchers are trying to narrow down to a few set of "interesting" genes for further analysis or experiments. If these genes have a strong signal, modifications like normalizations should not (substantially) effect the final results.

Therefore, the modifications may be justified. But I must remark, overdoing the normalizations may lead to wrong results.

Related Solutions

Solved – Is taking the median of a set of percentages statistically sound

What you are doing does not makes sense if your goal is to categorize what proportion of the entire population (sample A + sample B + sample C) is in category a, b, and c. Consider the following contingency table:

   a  b  c             a    b    c
A  8; 1; 1         A  .8;  .1;  .1
B  7; 2; 1         B  .7;  .2;  .1
C  1; 13; 16       C  .03; .43; .53

Then, for example, the median of the category a probabilities is 0.7 and the mean is 0.51, but only 16/50 = 0.32 of the all the observations are in column a. Likewise, the median of the category c probabilities would be 0.1, but only 0.36 of the observations are in column c. Does the "median summary" you propose tell you anything meaningful in a situation such as this one? Unless you have the marginal counts of either the samples or the categories, or you are willing to make some assumptions about them, I don't think there is a whole lot you can do in this case.

Do you have any specific goals in mind? Also, how many categories and samples do you have?

Edit: Your sample/population phrasing is slightly confusing. It's better to say you "have 3 samples, each which be sub-divided into 3 categories a,b, and c." The phrase "sample population" is troublesome, as is your reference to two different "populations."

Solved – the advantage of median polish over the median

What you call (linearly) "borrowing strength" corresponds to what statisticians refer to as affine equivariance. In essence, you want an affine equivariant estimator of location that is also robust to outliers. The best in class estimators are the SDE[1] and the FastMCD[2]

Both have several implementation in R. In both cases, the best implementation is probably in the rrcov package under the CovSde() and CovMcd() functions respectively.

library(MASS)
library(rrcov)
library(matrixStats)
CM<-matrix(0.95,5,5)
diag(CM)<-1
x<-mvrnorm(100,rep(0,5),CM)     
#the real data is correlated: you'd be better off borrowing 
#strength from the adjacent columns.    
z<-mvrnorm(10,rep(50,5),diag(5))    #the outliers
w<-rbind(x,z)

#all three essentially similar:
CovMcd(w)@center
CovSde(w)@center
colMeans(x)


#Not the same b/c of outliers
colMeans(w)
#Not the same b/c does not use the correlation structure:
colMedians(w)

[1] R. A. Maronna and V.J. Yohai (1995) The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90 (429), 330–341

[2] P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.

Best Answer

Related Solutions

Solved – Is taking the median of a set of percentages statistically sound

Solved – the advantage of median polish over the median

Related Question