Solved – Does normalization reduce (or remove) variance or bias

microarraynormalization

I'm currently analyzing microarray data.

Background on microarray normalization(not necessary to understand the question)

• Based on a global adjustment
$\log_2 {\frac{R}{G}} \rightarrow \log_2{\frac{R}{G}}-c\rightarrow \log_2{\frac{R}{kG}}$

• Choices for $k$ or $c = \log_2{k}$ are
–$c$= median or mean of log ratios for a
particular gene set •All genes or control or housekeeping genes.
–Total intensity normalization, where $k= \frac{\sum_iR_i}{\sum_iG_i}$.

So all what total intensity normalization is doing is making the totals of both slides or channels the same by multiplying one of the channels/arrays values by a normalization factor.

Normalization procedure
Some more information can be read above but this boils down to this: We can assume that there is no overall change (this is a microarray principle so I won't explain this here) so to compare two datasets (let say A and B) we just do: every value in the A dataset * tot(B) / tot(A) this will result in the same total for both datasets.

Question
Will this normalization ("shifting" the totals to the same value) adjust for variance or bias? in microarray reasearch the biases and variances are:

  • Bias caused by the choice of technique used and conditions tested
    or it can be due to the experimental procedure
  • variance caused by natural variability and measurement accuracy

Best Answer

The answer depends both on the parameter of interest and the experimental design.

The origins of the R and G variables do help in understanding for those not familiar with this technology. In 2-channel microarray experiments, you have 2 samples (e.g., collected under condition A and under condition B) of nucleic acid, with one sample labeled with a red (R) fluorescent marker and the other with green (G). The samples are mixed and applied to an array of spots, with each individual spot binding a specific nucleic-acid sequence. The ratio of R to G fluorescence on each spot then is used to measure the relative amounts of that particular nucleic-acid sequence in the 2 samples. The relative strengths of the R and G fluorescence signals on the array, however, could arise from different starting amounts of samples from A and B, different success in labeling the samples, differences in the volumes of the 2 labeled samples applied to the microarray, or different intrinsic intensities of fluorescent emission from the R and G markers.

From Wikipedia, the bias of an estimator is:

the difference between this estimator's expected value and the true value of the parameter being estimated.

So bias will mean something different if your parameter of interest is the fluorescence intensity per se, the ratio of amounts of nucleic acids in the labeled samples applied to the microarray, or the true ratio of amounts of some particular nucleic acid sequences between conditions A and B. I assume that you are interested in the latter.

If you do one experiment with sample A labeled with G and sample B labeled with R, the "normalization" tries to correct for the various technical ways in which G and R fluorescence might differ beyond the effects of the 2 conditions. The assumption is that most nucleic acid sequences don't differ in amount between the 2 conditions, so that a general correction for overall R/G signal ratio will correct for these potential technical difficulties. Only particular nucleic acids whose fluorescence ratios are adequately higher or lower than that overall R/G ratio would be considered significantly different between conditions A and B. So in terms of distinguishing the 2 conditions this is a correction for bias. In this case, with only one experiment, it's not very fruitful to talk about variance.

Now let's examine different labeling strategies for performing 6 replicates of this experiment. First, say that all samples from condition A are labeled G and all from condition B are labeled R. In that case a per-array normalization minimizes bias for each array (again, in terms of the ratio between conditions) and thus also the bias of an estimator based on the means of results among the 6 arrays. It's not immediately clear to me how much this would affect the variance of the estimator; my guess is probably little if at all.

Now say that instead you do 3 replicates with the above labeling scheme and 3 replicates with the labels reversed between the 2 conditions. In that case, bias in individual arrays in the log(A/B) ratio arising from general differences between R and G would average out even if there was no normalization. Even without normalization the true value of the log ratio between condition A and B, in this experimental design with balance of labels between conditions, could well be the expected value based on the ratios between the conditions (not colors) among the 6 arrays. In this case normalization does not affect bias; it does, however, greatly decrease variance among the 6 replicates.

Related Question