PCA – How to Select the Right PCA Type for Data Analysis

eigenvaluespcapearson-rspearman-rhoxlstat

I'm completing scientific analysis of chemical compounds in consumer products. As a non-statistician, I would really appreciate any thoughts from the experts here.

My data is non-normal so I've used non-parametric tests like MW and KW for hypothesis testing between samples so far. However I now have to conduct a principal component analysis (PCA) of the different compounds measured in the different products (measured in different units).

The stats add-in I was using asks that the type of data format be specified (eg: observation/variable table, versus a correlation or covariance matrix). I'm working with straight data so used the observation/variable table set-up.

But it also asks me to specify the PCA type from the following options (Pearson(n), Pearson (n-1), Spearman, Kendall, Covariance…). I tested the same data set with the Pearson (n) option and the Spearman option and got very different eigenalues and eigenvectors. The final biplot is naturally quite different.

Any help someone can provide regarding what the difference is, and what PCA type should be used would be greatly appreciated.

UPDATE: I was using XLSTAT (an Excel add-in). Is it okay to use Pearson as the "PCA type" when the correlations between the variables are non-linear? For example this "PCA type" option does not appear in other stats programs (eg: SPSS). So for example if using SPSS, the novice user would by default use Pearson "pca type".

Best Answer

The principal vectors are the eigenvectors of the the matrix you choose. When you choose Pearson you are choosing to find the eigenvectors of the Pearson correlation matrix. When you choose Spearman you are choosing to find the eigenvectors of the Spearman rank correlation matrix. The Spearman rank correlation is just the Pearson correlation between the ranked variables.

Due to the major difference in the nature of these matrices, it makes sense that the produced biplots are very different. If you really believe that the correlation between your variables is linear then I would stick with Pearson, otherwise I would go with Spearman or Kendall.