Solved – use PCA to do variable selection for cluster analysis

clusteringfactor analysisfeature selectionpca

I have to reduce the number of variables to conduct a cluster analysis. My variables are strongly correlated, so I thought to do a Factor Analysis PCA (principal component analysis). However, if I use the resulting scores, my clusters are not quite correct (compared to previous classifications in literature).

Question:

Can I use the rotation matrix to select the variables with the biggest loads for each component/factor and use only these variables for my clustering?

Any bibliographic references would also be helpful.

Update:

Some clarifiations:

  • My goal:
    I have to run a clusters analysis with two-step algorithm by SPSS, but my variables are not independents, so I thought about discarding some of them.

  • My dataset:
    I am working on 15 scalar parameters (my variables) of 100,000 cases. Some variables are strongly correlated ($>0.9$ Pearson)

  • My doubt:
    Since I need only independent variables, I thought to run a Principal Component Analysis (sorry: I wrongly talked about Factor Analysis in my original question, my mistake) and select only the variables with the biggest loadings for each component. I know that the PCA process presents some arbitrary steps, but I found out that this selection is actually similar to the "method B4" proposed by I.T. Jolliffe (1972 & 2002) to select variables and suggested also by J.R. King & D.A. Jackson in 1999.

    So I was thinking to select in this way some sub-groups of independent variables. I will then use the groups to run different cluster analysis and I will compare the results.

Best Answer

I will, as is my custom, take a step back and ask what it is you are trying to do, exactly. Factor analysis is designed to find latent variables. If you want to find latent variables and cluster them, then what you are doing is correct. But you say you simply want to reduce the number of variables - that suggests principal component analysis, instead.

However, with either of those, you have to interpret cluster analysis on new variables, and those new variables are simply weighted sums of the old ones.

How many variables have you got? How correlated are they? If there are far too many, and they are very strongly correlated, then you could look for all correlations over some very high number, and randomly delete one variable from each pair. This reduces the number of variables and leaves the variables as they are.

Let me also echo @StasK about the need to do this at all, and @rolando2 about the usefulness of finding something different from what has been found before. As my favorite professor in grad school used to say "If you're not surprised, you haven't learned anything".