Solved – When plotting clustering results in the PCA coordinates, does one do PCA or clustering first

clusteringpcar

Recently I have come across usage of cluster plot, which combines k-mean clustering along with PCA. The plot shows different clusters plotted using first two PCs. I have checked some of the threads (here and here) regarding the usage.

I want to know, during generating a cluster plot, does the data is clustered first and then PCA is done, or the reverse way (PCA followed by k-mean clustering)?

Because the second link says PCA is done followed clustering. But in the first link where an example is shown to generate a cluster plot, data is clustered first and then the cluster plot is generated.

Regarding interpretation, does the plot has to be interpreted as the number of clusters generated or are there any extra points to interpret?

Best Answer

It is hard to see how you could do PCA on clusters; it is quite common to do PCA prior to clustering, particularly when you have a lot of variables. You can then use the PCs as variables.

You might be getting confused between a different two alternatives:

1) Do PCA on the data, then do k-means on the PCs, then plot the results

2) Do k-means on the data, do PCA on the data, then plot the clusters in terms of means on the PCs.

Both of these seem reasonable to me; the first may be better when there are many variables or when k-means on the data doesn't yield anything useful.

The former condenses the data in order to do cluster analysis. The latter condenses the data in order to visualize cluster analysis.

Related Question