PCA-Variance – Why Log-Transformation of RNA-seq Data Reduces Explained Variance in PCA

bioinformaticsdata transformationlogarithmpcavariance

I am running a PCA on a dataset with 2k rows and 36k columns. I noticed that when I log-transform the data I need to ask for more principal components during PCA to achieve the same amount of explained variance (see images attached).

What is the underlying reason for this? I have a naive intuition that log-transforming "compresses" the data, therefore the amount of explained variance for the different principal components is being "homogenized". Is there a rigorous explanation for this or did I make a mistake?

Nolog
Log10

Best Answer

Based on the size of your dataset, I suspect you are working with the single cell RNA-seq data.

If so, I can confirm your observation: with scRNA-seq data, PCA explained variances after log-transform are typically much lower than beforehand. Here is a replication of your finding with the Tasic et al. 2016 dataset I have at hand:

![enter image description here

Here I used $\log(x+1)$ because of exact zeros. Note that log-transformed data yield explained variances roughly similar to the standardized data (when each variable is centered and scaled to have unit variance).

The reason for this is that different variables (genes) have VERY different variances. RNA-seq data are ultimately counts of RNA molecules, and the variance is monotonically growing with the mean (think Poisson distribution). So the genes that are highly expressed will have high variance whereas the genes that are barely expressed or detected at all, will have almost zero variance:

![enter image description here

Without any transformations, there is one gene that alone explains above 40% of the variance (i.e. its variance is above 40% of the total variance). In this dataset, it happens to be this gene: https://en.wikipedia.org/wiki/Neuropeptide_Y which is very highly expressed (RPKM values over 100000) in some cells and has zero expression in some other cells. When you do PCA on the raw data, PC1 will basically coincide with this single gene.

This is similar to what happens in the accepted answer to PCA on correlation or covariance?:

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains 82% of the variance) and PC2 is almost equal to javelin (together they explain 97%). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to 64% and 71%).

Update

In the comments, @An-old-man-in-the-sea brought up the issue of the variance-stabilizing transformations. The RNA-seq counts are usually modeled with a negative binomial distribution that has the following mean-variance relationship: $$V(\mu) = \mu + \frac{1}{r}\mu^2.$$

If we neglect the first term (which makes some sense under assumption that highly expressed genes carry the most information for PCA), then the remaining mean-variance relationship becomes quadratic which coincides with the log-normal distribution and has logarithm as the variance stabilizing transformation: $$\int\frac{1}{\sqrt{\mu^2}}d\mu=\log(\mu).$$

Alternatively, small powers $\mu^\alpha$ with e.g. $\alpha\approx 0.1$ or so can also make sense and would be variance-stabilizing for $V(\mu)=\mu^{2-2\alpha}$, so something in between the linear and the quadratic mean-variance relationships.

Another option is to use $$\int\frac{1}{\sqrt{\mu+\frac{1}{r}\mu^2}}d\mu=\operatorname{arsinh}\Big(\frac{x}{r}\Big)=\log\Big(\sqrt{\frac{\mu}{r}}+\sqrt{\frac{\mu^2}{r^2}+1}\Big),$$ possibly with Anscombe's correction as $$\operatorname{arsinh}\Big(\frac{x+3/8}{r-3/4}\Big).$$ Clearly for large $\mu$ all of these formulas reduce to $\log(x)$.

See Harrison, 2015, Anscombe's 1948 variance stabilizing transformation for the negative binomial distribution is well suited to RNA-Seq expression data and Anscombe, 1948, The Transformation of Poisson, Binomial and Negative-Binomial Data.