Solved – The use of median polish for feature selection

feature selectiongeneticsmedian

In a paper I was reading recently I came across the following bit in their data analysis section:

The data table was then split into tissues and cell lines, and the two subtables were separately median polished (the rows and columns were iteratively adjusted to have median 0) before being rejoined into a single table. We finally then selected for the subset of genes whose expression varied by at least 4-fold from the median in this sample set in at least three of the samples tested

I have to say I don't really follow the reasoning here. I was wondering if you could help me answer the following two questions:

  1. Why is it desirably/helpful to adjust the median in the datasets? Why should it be done separately for different type of samples?

  2. How is this not modifying the experimental data? Is this a known way of picking a number of genes/variables from a large set of data, or is it rather adhoc?

Thanks,

Best Answer

Tukey Median Polish, algorithm is used in the RMA normalization of microarrays. As you may be aware, microarray data is quite noisy, therefore they need a more robust way of estimating the probe intensities taking into account of observations for all the probes and microarrays. This is a typical model used for normalizing intensities of probes across arrays.

$$Y_{ij} = \mu_{i} + \alpha_{j} + \epsilon_{ij}$$ $$i=1,\ldots,I \qquad j=1,\ldots, J$$

Where $Y_{ij}$ is the $log$ transformed PM intensity for the $i^{th}$probe on the $j^{th}$ array. $\epsilon_{ij}$ are background noise and they can be assumed to correspond to noise in normal linear regression. However, a distributive assumption on $\epsilon$ may be restrictive, therefore we use Tukey Median Polish to get the estimates for $\hat{\mu_i}$ and $\hat{\alpha_j}$. This is a robust way of normalizing across arrays, as we want to separate signal, the intensity due to probe, from the array effect, $\alpha$. We can obtain the signal by normalizing for the array effect $\hat{\alpha_j}$ for all the arrays. Thus, we are only left with the probe effects plus some random noise.

The link that I have quoted before uses Tukey median polish to estimate the differentially expressed genes or "interesting" genes by ranking by the probe effect. However, the paper is pretty old, and probably at that time people were still trying to figure out how to analyze microarray data. Efron's non-parametric empirical Bayesian methods paper came in 2001, but probably may not have been widely used.

However, now we understand a lot about microarrays (statistically) and are pretty sure about their statistical analysis.

Microarray data is pretty noisy and RMA (which uses Median Polish) is one of the most popular normalization methods, may be because of its simplicity. Other popular and sophisticated methods are: GCRMA, VSN. It is important to normalize as the interest is probe effect and not array effect.

As you expect, the analysis could have benefited by some methods which take advantage of information borrowing across genes. These may include, Bayesian or empirical Bayesian methods. May be the paper that you are reading is old and these techniques weren't out until then.

Regarding your second point, yes they are probably modifying the experimental data. But, I think, this modification is for a better cause, hence justifiable. The reason being

a) Microarray data are pretty noisy. When the interest is probe effect, normalizing data by RMA, GCRMA, VSN, etc. is necessary and may be taking advantage of any special structure in the data is good. But I would avoid doing the second part. This is mainly because if we don't know the structure in advance, it is better not impose a lot of assumptions.

b) Most of the microarray experiments are exploratory in their nature, that is, the researchers are trying to narrow down to a few set of "interesting" genes for further analysis or experiments. If these genes have a strong signal, modifications like normalizations should not (substantially) effect the final results.

Therefore, the modifications may be justified. But I must remark, overdoing the normalizations may lead to wrong results.