PCA Variance – Is There Any Required Amount of Variance Captured by PCA for Later Analyses?

pcavariance

I have a dataset with 11 variables and PCA (orthogonal) was done to reduce the data. Deciding on the number of components to keep it was evident for me from my knowledge about the subject and the scree plot (see below) that two principal components (PCs) were enough to explain the data and the remaining components were only less informative.

enter image description here
Scree plot with parallel analysis: observed eigenvalues (green) and simulated eigenvalues based on 100 simulations (red). Scree plot suggests 3 PCs, whereas parallel test suggests only the first two PCs.

enter image description here

As you can see only 48% of the variance could be captured by the first two PCs.

Plotting observations on the first plane made by the first 2 PCs revealed three different clusters using hierarchical agglomerative clustering (HAC) and K-means clustering. These 3 clusters turned out to be very relevant to the problem in question and were consistent with other findings as well. So except the fact that only 48% of variance was captured everything else was tremendously fine.

One of my two reviewers said: one cannot rely much on these findings as only 48% of variance could be explained and it is less than required.

Question
Is there any required value of how much variance should be captured by PCA to be valid? Is it not dependent on the domain knowledge and methodology in use? Can anybody judge on the merit of the whole analysis just based on the mere value of the explained variance?

Notes

Data are 11 variables of genes measured by a very sensitive methodology in molecular biology called Real-Time Quantitative Polymerase Chain Reaction (RT-qPCR).
Analyses were done using R.
Answers from data analysts based on their personal experience working on real-life problems in the fields of microarray analysis, chemometrics, spectometric analyses or alike are much appreciated.
Please consider supporting you answer with references as much as possible.

Best Answer

Regarding your particular questions:

Is there any required value of how much variance should be captured by PCA to be valid?

No, there is not (to my best of knowledge). I firmly believe that there is no single value you can use; no magic threshold of the captured variance percentage. The Cangelosi and Goriely's article : Component retention in principal component analysis with application to cDNA microarray data gives a rather nice overview of half a dozen standard rules of thumb to detect the number of components in a study. (Scree plot, Proportion of total variance explained, Average eigenvalue rule, Log-eigenvalue diagram, etc.) As rules of thumb I would not strongly rely on any of them.

Is it not dependent on the domain knowledge and methodology in use?

Ideally it should be dependent but you need to be careful how you word it and what you mean.

For example: In Acoustics there is the notion of Just Noticeable Difference (JND). Assume you are analyzing an acoustics sample and a particular PC has physical-scale variation well below that JND threshold. Nobody can readily argue that for an Acoustics application you should have included that PC. You would be analyzing inaudible noise. There might be some reasons to include this PC but these reasons need to be presented not the other way around. Are they notions similar to JND for RT-qPCR analysis?

Similarly, if a component looks like 9th order Legendre polynomial and you have strong evidence that your sample consists of single Gaussian bumps you have good reasons to believe you are again modeling irrelevant variation. What are these orthogonal modes of variation showing? What is "wrong" with the 3rd PC in your case for example?

The fact that you say "These 3 clusters turned out to be very relevant to the problem in question" is not really a strong argument. You might simple data dredge (which is a bad thing). There are other techniques, eg. Isomaps and locally-linear embedding, which are pretty cool too, why not use those? Why did you choose PCA specifically?

The consistency of your findings with other findings is more important, especially if these finding are considered well-established. Dig deeper on this. Try to see if your results agree with PCA findings from other studies.

Can anybody judge on the merit of the whole analysis just based on the mere value of the explained variance?

In general one should not do that. Do not think that your reviewer is a bastard or anything like that though; 48% is indeed a small percentage to retain without presenting reasonable justifications.

Reporting standard deviations instead of variances

I think you are right in that standard deviation of each PC can perhaps be a more reasonable or a more intuitive (for some) measure of its "influence" than its variance. And actually it even has a clear mathematical interpretation: variances of PCs are eigenvalues of the covariance matrix, but standard deviations are singular values of the centered data matrix [only scaled by $1/\sqrt{n-1}$].

So yes, it is completely fine to report it. Moreover, e.g. R does report standard deviations of PCs rather than their variances. For example running this simple code:

irispca <- princomp(iris[-5])
summary(irispca)

results in this:

Importance of components:
                          Comp.1     Comp.2     Comp.3      Comp.4
Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700
Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
Cumulative Proportion  0.9246187 0.97768521 0.99478782 1.000000000

There are standard deviations here, but not variances.

Explained variance

A PC that contains 95% of the data variance might contain only 80% of the variation in the data as measured in standard deviations: isn't the latter a better descriptor?

However, note that after presenting standard deviations, R does not display a "proportion of standard deviation", but instead a proportion of variance. And there is a very good reason for that.

Mathematically, total variance (being a trace of covariance matrix) is preserved under rotations. This means that the sum of variance of original variables is equal to the sum of variances of PCs. In case of the same Fisher Iris dataset, this sum is equal to $4.57$, and so we can say that PC1, having a variance of $2.05^2=4.20$ explains $92\%$ of the total variance.

But the sum of standard deviations is not preserved! The sum of standard deviations of original variables is $3.79$. The sum of standard deviations of PCs is $2.98$. They are not equal! So if you want to say that PC1 with standard deviation $2.05$ explains $x\%$ of the "total standard deviation", what would you take as this total? There is no answer, because it simply does not make sense.

The bottom line is that it is completely fine to look at the standard deviation of each PC and even compare them between each other, but if you want to talk about "explained" something, then only "explained variance" makes sense.

Solved – Using PCA in Matlab: Is it based on the covariance or correlation matrix

In regards to the question in the title: The function pca in MATLAB uses the SVD of the centred dataset to perform PCA; this excellent thread elucidates the relation between the two. Using the SVD corresponds to using the covariance matrix, not the correlation matrix.

Having said that and to answer the main question of post: if one z-scores the data and then uses the covariance matrix for PCA, the results will be equivalent to using the correlation matrix of the original data. This can be easily seen by computing the difference: cov(zscore(A)) - corr(A) which should be zero to numerical precision (where Ais the dataset matrix).

So yes, there will be a difference if you use the correlation-based instead of the covariance-based PCA methodology; if you $z$-score your dataset though the two methodologies will give equal results. In general, I would recommend you $z$-scale your variables before doing PCA, especially if they are measured in different scales. Otherwise the differences in their magnitudes can potentially dominate the subsequent eigenanalysis (and the interpretation of the final results). This topic is explored in more detail in this thread on doing PCA on correlation or covariance?

Best Answer

Related Solutions

Solved – Why is variance (instead of standard deviation) the default measure of information content in principal components

Reporting standard deviations instead of variances

Explained variance

Solved – Using PCA in Matlab: Is it based on the covariance or correlation matrix

Related Question