I have noticed that when applying PCA to large datasets, people often will first subset the data considerably. Sometimes people just randomly take a subset of the features/variables, but often they have a reason, largely related to removing variables they consider to be likely to be noise. A prototypical example is in the data analysis the Drop-Seq single cell sequencing of retina cells, the authors subset their expression data matrix from 25,000 genes to the the 384 most highly variable genes and then proceed to use various unsupervised dimensionality reduction techniques like PCA and t-SNE.
I have seen this sort of pre-processing in several other places as well. However, I don't understand why this sort of subsetting (feature pre-selection) is necessary. PCA will reduce the dimensionality such that the variance will be maximized–hence, the genes that are not varying will be largely ignored. Why so dramatically subset the data when the non-varying genes should not really have much of an effect on the result of PCA?
This is not a specific question about this paper, it seems to be something of a standard approach to large datasets, so I assume that there is something I am missing.
Best Answer
The paper itself is openly available online, but its supplementary materials are not, so I copy here the relevant parts. Here is how the authors ran PCA:
And here is how they selected the "highly variable" genes:
Notice two things:
They select "highly variable" genes based on their variances (relative to the mean, but this is not important here). The genes with unusually large variances will get selected.
They perform PCA after scaling, i.e. $z$-scoring, the data for each gene. In other words, PCA is done on the correlation matrix, not on covariance matrix. The scaled genes that go into PCA all have the same variance equal to $1$.
This explains why the pre-selection is not superfluous here: the PCA is done on correlations, i.e. without using the variance information at all; the variances are only used for pre-selection.
One can certainly imagine a situation where PCA on correlations between all genes and PCA on correlations between "highly variable" genes will yield very different results. E.g. in principle it can happen that the least variable genes are higher correlated (and will dominate in PCA) than the highly variable genes.
I have no experience with such data, so I cannot comment on how useful this procedure is in this particular application domain.