Solved – How would PCA help with a k-means clustering analysis

clusteringdimensionality reductionk-meanspca

Background: I want to classify the residential areas of a city into groups based on their social-economic characteristics, including housing unit density, population density, green space area, housing price, number of schools / health centers / day care centers, etc. I want to understand how many different groups the residential areas can be divided into, and what are their unique characteristics. This information could facilitate city planning.

Based on some examples (cf., this blog post: PCA and K-means Clustering of Delta Aircraft), I figured out the way to do the analysis is:

  1. First do PCA analysis.

  2. Determine the number of unique groups (clusters) based on PCA results (e.g., using the "elbow" method, or alternatively, the number of components that explains 80 to 90% of total variance).

  3. After determining the number of clusters, apply k-means clustering to do the classification.

My questions: it seemed the number of PCA components is related to clusters analysis. So is that's true, if, say, we found 5 PCA components explained more than 90% of variation of all features, then we would apply k-means clustering and get 5 clusters. So would the 5 groups exactly corresponding to the 5 components in PCA analysis?

In another words, I guess my question is: What is the connection between the PCA analysis and k-means clustering?

Updates:
Thanks to Emre, xeon, and Kirill's inputs. So the current answers:

  1. Doing PCA before clustering analysis is also useful for dimensionality reduction as a feature extractor and visualize / reveal clusters.

  2. Doing PCA after clustering can validate the clustering algorithm (reference: Kernel principal component analysis).

  3. PCA is sometimes applied to reduce the dimensionality of the dataset prior to clustering. However, Yeung & Ruzzo (2000) showed that clustering with the PC's instead of the original variables does not necessarily improve cluster quality. In particular, the first few PC's (which contain most of the variation in the data) do not necessarily capture most of the cluster structure.

    • Yeung, Ka Yee, and Walter L. Ruzzo. An empirical study on principal component analysis for clustering gene expression data. Technical report, Department of Computer Science and Engineering, University of Washington, 2000. (pdf)
  4. It seemed PCA is necessary before a two-step clustering analysis. Based on Ibes (2015), in which cluster analysis was run using the factors identified in the PCA.

Best Answer

PCA is not a clustering method. But sometimes it helps to reveal clusters.

Let's assume you have 10-dimensional Normal distributions with mean $0_{10}$ (vector of zeros) and some covariance matrix with 3 directions having bigger variance than others. Applying principal component analysis with 3 components will give you these directions in decreasing order and 'elbow' approach will say to you that this amount of chosen components is right. However, it will be still a cloud of points (1 cluster).

Let's assume you have 10 10-dimensional Normal distributions with means $1_{10}$, $2_{10}$, ... $10_{10}$ (means are staying almost on the line) and similar covariance matrices. Applying PCA with only 1 component (after standardization) will give you the direction where you will observe all 10 clusters. Analyzing explained variance ('elbow' approach), you will see that 1 component is enough to describe this data.

In the link you show PCA is used only to build some hypotheses regarding the data. The amount of clusters is determined by 'elbow' approach according to the value of within groups sum of squares (not by explained variance). Basically, you repeat K-means algorithm for different amount of clusters and calculate this sum of squares. If the number of clusters equal to the number of data points, then sum of squares equal $0$.