Solved – What can be the reason to do feature selection based on variance before doing PCA

bioinformaticsfeature selectionmachine learningpca

I have noticed that when applying PCA to large datasets, people often will first subset the data considerably. Sometimes people just randomly take a subset of the features/variables, but often they have a reason, largely related to removing variables they consider to be likely to be noise. A prototypical example is in the data analysis the Drop-Seq single cell sequencing of retina cells, the authors subset their expression data matrix from 25,000 genes to the the 384 most highly variable genes and then proceed to use various unsupervised dimensionality reduction techniques like PCA and t-SNE.

I have seen this sort of pre-processing in several other places as well. However, I don't understand why this sort of subsetting (feature pre-selection) is necessary. PCA will reduce the dimensionality such that the variance will be maximized–hence, the genes that are not varying will be largely ignored. Why so dramatically subset the data when the non-varying genes should not really have much of an effect on the result of PCA?

This is not a specific question about this paper, it seems to be something of a standard approach to large datasets, so I assume that there is something I am missing.

Best Answer

The paper itself is openly available online, but its supplementary materials are not, so I copy here the relevant parts. Here is how the authors ran PCA:

We ran Principal Components Analysis (PCA) on our training set as previously described (Shalek et al., 2013), using the prcomp function in R, after scaling and centering the data along each gene. We used only the previously identified “highly variable” genes as input to the PCA in order to ensure robust identification of the primary structures in the data.

And here is how they selected the "highly variable" genes:

We first identified the set of genes that was most variable across our training set, after controlling for the relationship between mean expression and variability. We calculated the mean and a dispersion measure (variance/mean) for each gene across all 13,155 single cells, and placed genes into 20 bins based on their average expression. Within each bin, we then z-normalized the dispersion measure of all genes within the bin, in order to identify outlier genes whose expression values were highly variable even when compared to genes with similar average expression. We used a z-score cutoff of 1.7 to identify 384 highly variable genes.

Notice two things:

They select "highly variable" genes based on their variances (relative to the mean, but this is not important here). The genes with unusually large variances will get selected.
They perform PCA after scaling, i.e. $z$-scoring, the data for each gene. In other words, PCA is done on the correlation matrix, not on covariance matrix. The scaled genes that go into PCA all have the same variance equal to $1$.

This explains why the pre-selection is not superfluous here: the PCA is done on correlations, i.e. without using the variance information at all; the variances are only used for pre-selection.

One can certainly imagine a situation where PCA on correlations between all genes and PCA on correlations between "highly variable" genes will yield very different results. E.g. in principle it can happen that the least variable genes are higher correlated (and will dominate in PCA) than the highly variable genes.

I have no experience with such data, so I cannot comment on how useful this procedure is in this particular application domain.

Related Solutions

Solved – Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data

The truth is PCA contains an inherent assumption of linearity, i.e. that changing the basis can reframe the problem to provide a more discriminating view on the data. Does it have to be true when working with Zipf/power law following data? It depends on whether all your variables are of the same distribution. If so, you could take a logarithm of the values of all columns and perform PCA with sensible results.

Power law makes your variances explode, PCA will of course yield results, but they will be hard to interpret without making a mistake of arguing that a phenomenon is happening when it actually is only happening in the top 20% outliers. You can also try to use PCA to see the major differences, then divide the data to a point where the long tail is separated from the top outliers and then a PCA on the tail?

A good tutorial on PCA with assumptions can be found here: Jonathon Shlens: A Tutorial on Principal Component Analysis. CoRR abs/1404.1100 (2014)

Solved – Using PCA for detecting similar regions in an image

It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)

Here I made up a very simple example myself, in Octave, with a unidimensional signal (y) of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3). I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.

N=200;
b=3;
delay=130; 
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50);  % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)];  % octave does not have  corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z);  % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1));  % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay

Best Answer

Related Solutions

Solved – Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data

Solved – Using PCA for detecting similar regions in an image

Related Question