Solved – What can be the reason to do feature selection based on variance before doing PCA

bioinformaticsfeature selectionmachine learningpca

I have noticed that when applying PCA to large datasets, people often will first subset the data considerably. Sometimes people just randomly take a subset of the features/variables, but often they have a reason, largely related to removing variables they consider to be likely to be noise. A prototypical example is in the data analysis the Drop-Seq single cell sequencing of retina cells, the authors subset their expression data matrix from 25,000 genes to the the 384 most highly variable genes and then proceed to use various unsupervised dimensionality reduction techniques like PCA and t-SNE.

I have seen this sort of pre-processing in several other places as well. However, I don't understand why this sort of subsetting (feature pre-selection) is necessary. PCA will reduce the dimensionality such that the variance will be maximized–hence, the genes that are not varying will be largely ignored. Why so dramatically subset the data when the non-varying genes should not really have much of an effect on the result of PCA?

This is not a specific question about this paper, it seems to be something of a standard approach to large datasets, so I assume that there is something I am missing.

Best Answer

The paper itself is openly available online, but its supplementary materials are not, so I copy here the relevant parts. Here is how the authors ran PCA:

We ran Principal Components Analysis (PCA) on our training set as previously described (Shalek et al., 2013), using the prcomp function in R, after scaling and centering the data along each gene. We used only the previously identified “highly variable” genes as input to the PCA in order to ensure robust identification of the primary structures in the data.

And here is how they selected the "highly variable" genes:

We first identified the set of genes that was most variable across our training set, after controlling for the relationship between mean expression and variability. We calculated the mean and a dispersion measure (variance/mean) for each gene across all 13,155 single cells, and placed genes into 20 bins based on their average expression. Within each bin, we then z-normalized the dispersion measure of all genes within the bin, in order to identify outlier genes whose expression values were highly variable even when compared to genes with similar average expression. We used a z-score cutoff of 1.7 to identify 384 highly variable genes.


Notice two things:

  1. They select "highly variable" genes based on their variances (relative to the mean, but this is not important here). The genes with unusually large variances will get selected.

  2. They perform PCA after scaling, i.e. $z$-scoring, the data for each gene. In other words, PCA is done on the correlation matrix, not on covariance matrix. The scaled genes that go into PCA all have the same variance equal to $1$.

This explains why the pre-selection is not superfluous here: the PCA is done on correlations, i.e. without using the variance information at all; the variances are only used for pre-selection.

One can certainly imagine a situation where PCA on correlations between all genes and PCA on correlations between "highly variable" genes will yield very different results. E.g. in principle it can happen that the least variable genes are higher correlated (and will dominate in PCA) than the highly variable genes.

I have no experience with such data, so I cannot comment on how useful this procedure is in this particular application domain.