Solved – Can PCA allow to identify redundant variables that can be removed before doing cluster analysis

clusteringfeature selectionmultivariate analysispca

I hope this is suitable for this forum: I am new to PCA and what I ultimately want to do is perform cluster analysis on my dataset.

I have 20 physical descriptor variables for organisms, each with 300 datapoints. I produced a correlation matrix to look at which variables may be correlated with each other, wherein I found there were a number of variables that were correlated with each other.

I want to remove any redundant variables from the analysis (ones that aren't really contributing anything) before I carry out a cluster analysis on my dataset. I carried out PCA and found that 3 principal components account for about 90% of the variance of my data. My question pertains to how I interpret this output: Do i need to identify what variables were included in these three principal components, and then remove the variables that were not included? Is this even the correct approach to identifying variables that are not contributing any information to the dataset?

For context: What I ultimately want to do is to reduce the number of variables required to describe groups of organisms, which will allow me to model other organisms that share these physical descriptors (but for which there are no data collected).


Edit

Thanks for the advice everyone. @hssay: you have highlighted an issue I was wondering about: Whether to carry out the cluster analysis on the original data or on my PCA output. The fact that the derived new variables lack interpretability certainly gives me pause to reconsider my approach. Thank you for clarifying.

If I were to carry out the cluster analysis on the derived variables, is it possible to extract/identify the original variables post-clustering, or are they lost? eg. If I were to carry out the cluster analysis on the Principal Components, the clusters themselves would no longer have any real meaning regarding the other organisms I referred to (ie. the ones that displayed certain physical characteristics from the original dataset, but for which the measurements didn't exist). The reason being, that the clusters would be made up of derived variables, not real-world physical descriptor variables. Is that correct?

@ Paul Siegel Thanks for the words of warning. My data are not categorical, but I take your point. I will look at the other approaches you suggested.

@Frank Harrell I don't use R, only matlab and would like to keep my code to just one language…I will certainly look at the function code/reading you suggest though.

@DJohnson. Thanks! I'll give your methodology a go.

Best Answer

Also consider sparse principal component analysis, and redundancy analysis. The latter is implemented in the R Hmisc package redun function and involves attempting to predict each predictor from all the other predictors. It handles the "wings" issue discussed above.