Solved – PCA before cluster analysis

clusteringpca

I am trying to do a PCA to reduce the no. of variables in my data before performing a cluster analysis. Suppose I extract 3 principal components P1, P2 and P3. Now when I am to do the clustering, on which variables should I run my analysis? I am not very clear as to should I use all the initial variables (then how will PCA help) or should I use the extracted 3 components? A detailed answer with example will be very helpful

Best Answer

How many features are in your original data? If it is not too many (say thousands), many clustering algorithm can work in your original data.

By using PCA you are losing information. If you do not want to lose too much, you can use as many PC as possible. (assume you can afford the computational efforts and there are not curse of dimensionality problem)

If you want to check how much information you lose, you can check my answers to this post to see how to get how much information (variance) preserved by PCA.

How to calculate how much variance a set of regressors explains on another data set using PCA transformation?


To you comment:

If you really want to use PCA, you can run clustering algorithm on the transformed data. In R with toy iris data. It is pca_out$x

pca_out=prcomp(iris[,1:3])
pca_out$x
                   PC1          PC2           PC3
      [1,] -2.49088018 -0.320973364 -0.0339745251
      [2,] -2.52334286  0.178400622 -0.2329011355
      [3,] -2.71114888  0.137820058 -0.0025055723
      [4,] -2.55775595  0.315675226  0.0670512306
      [5,] -2.53896432 -0.331356903  0.0986154338
      [6,] -2.13542015 -0.750523350  0.1367151904
      [7,] -2.67669609  0.072944140  0.2311696738
      [8,] -2.42912498 -0.162931683  0.0007979233
      [9,] -2.70915877  0.572318127  0.0322430634
     [10,] -2.44080592  0.123908243 -0.1318158483
     [11,] -2.30049402 -0.641538592 -0.0654553841
     [12,] -2.41545393 -0.015273540  0.1681603305
     [13,] -2.56232620  0.242322950 -0.1666121092
     [14,] -3.03215612  0.502494126  0.0604799584
     [15,] -2.44677625 -1.179585963 -0.2360617554
     [16,] -2.24724960 -1.353446638  0.1997840653
     [17,] -2.50197109 -0.829777299 -0.0024222281
     [18,] -2.49088018 -0.320973364 -0.0339745251
     [19,] -2.00936932 -0.867984466 -0.1284528211
     [20,] -2.42654485 -0.524077475  0.1997126274

Note I am showing first 20 data points after the transformation. You can use all 3 transformed features without information loss. OR you can use first 2 columns. Then your data becomes 2 dimensional but lose some information.