Solved – How to find which variables are most correlated with the first principal component

bioinformaticspcar

I came across an article where the authors did a Principal Component Analysis on gene expression data, and found out the genes that are most correlated to the 1st principal component, and they used that gene list for further analysis.
Can somebody tell me how to find out entities (genes in this case) that are most correlated to the 1st principal component?

Here's the link to the original free article, and this is how they've calculated it:

Results of the gene set enrichment analysis of the genes most correlated to the 1st principal component. The correlations of the genes with the 1st principal component were transformed to SDs from the mean, then genes with values > 1.5 (positive correlation) or < -1.5 (negative correlation) were selected.

I can make a toy example of 6 samples and 20 gene matrix and do the PCA the following way but how to proceed next:

rm(list=ls())
set.seed(12345)
my.mat <- matrix(rnorm(120,0,0.5),nrow=6,byrow=TRUE)
rownames(my.mat) <- paste("s",1:6,sep="")
colnames(my.mat) <- paste("g",1:20,sep="")
head(my.mat)

#Ensure that input data is Z-transformed 
pca.object <- prcomp(my.mat,center=TRUE,scale.=TRUE)
summary(pca.object)
par(mfrow=c(1,2))
plot(pca.object)
biplot(pca.object)

#The Rotation
pca.object$rotation

Best Answer

Summary: if the original variables were standardized, then you should simply look at the first principal axis (rotation in the R terminology) and select variables with highest absolute values.


Consider dataset $\mathbf{X}$ with centered variables in columns and $N$ data points in rows. Performing PCA of this dataset amounts to singular value decomposition $\mathbf{X} = \mathbf{U} \mathbf{S} \mathbf{V}^\top$. Here columns of $\mathbf{V}$ are principal axes, $\mathbf{S}$ is a diagonal matrix with singular values, and columns of $\mathbf{U}$ are principal components scaled to unit norm. Standardized PCs are given by $\sqrt{N-1}\mathbf{U}$. PCs themselves (also known as "scores") are given by columns of $\mathbf{US}$.

Note that covariance matrix is given by $\frac{1}{N-1}\mathbf{X}^\top \mathbf{X} = \mathbf{V}\frac{\mathbf{S}^2}{{N-1}}\mathbf{V}^\top=\mathbf{VEV}^\top$, so principal axes $\mathbf{V}$ are eigenvectors of the covariance matrix and $\mathbf E=\frac{\mathbf S^2}{N-1}$ are its eigenvalues.

We can now compute cross-covariance matrix between original variables and standardized PCs: $$\frac{1}{N-1}\mathbf{X}^\top(\sqrt{N-1}\mathbf{U}) = \frac{1}{\sqrt{N-1}}\mathbf{V}\mathbf{S}\mathbf{U}^\top\mathbf{U} = \mathbf{V}\frac{\mathbf{S}}{\sqrt{N-1}}=\mathbf{V}\sqrt{\mathbf E}=\mathbf{L}.$$ This matrix is called loadings matrix: it is given by the eigenvectors of the covariance matrix scaled by the square roots of the respective eigenvalues.

Cross-correlation matrix between original variables and PCs is given by the same expression divided by the standard deviations of the original variables (by definition of correlation). If the original variables were standardized prior to performing PCA (i.e. PCA was performed on the correlation matrix) they are all equal to $1$. In this last case the cross-correlation matrix is again given simply by $\mathbf{L}$.

You are only interested in the top correlations with the first PC, which means that you should look at the first column of $\mathbf{L}$ and select variables with highest absolute values. But notice that the first column of $\mathbf{L}$ is equal to the first column of $\mathbf{V}$, up to a scaling factor (given by the square root of the first eigenvalue of the covariance matrix). So equivalently, you can look at the first column of $\mathbf{V}$ (i.e. the first principal axis, or rotation in the R terminology) and select variables with highest absolute values.

But again, this is only true if the original variables were standardized.

Related Question