Solved – the relation between k-means clustering and PCA

clusteringk-meanspca

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction).

However I am interested in a comparative and in-depth study of the relationship between PCA and k-means. For example, Chris Ding and Xiaofeng He, 2004, K-means Clustering via Principal Component Analysis showed that "principal components are the continuous
solutions to the discrete cluster membership indicators for K-means clustering". However, I have hard time understanding this paper, and Wikipedia actually claims that it is wrong.

Also, the results of the two methods are somewhat different in the sense that PCA helps to reduce the number of "features" while preserving the variance, whereas clustering reduces the number of "data-points" by summarizing several points by their expectations/means (in the case of k-means).
So if the dataset consists in $N$ points with $T$ features each, PCA aims at compressing the $T$ features whereas clustering aims at compressing the $N$ data-points.

I am looking for a layman explanation of the relations between these two techniques + some more technical papers relating the two techniques.

Best Answer

It is true that K-means clustering and PCA appear to have very different goals and at first sight do not seem to be related. However, as explained in the Ding & He 2004 paper K-means Clustering via Principal Component Analysis, there is a deep connection between them.

The intuition is that PCA seeks to represent all $n$ data vectors as linear combinations of a small number of eigenvectors, and does it to minimize the mean-squared reconstruction error. In contrast, K-means seeks to represent all $n$ data vectors via small number of cluster centroids, i.e. to represent them as linear combinations of a small number of cluster centroid vectors where linear combination weights must be all zero except for the single $1$. This is also done to minimize the mean-squared reconstruction error.

So K-means can be seen as a super-sparse PCA.

Ding & He paper makes this connection more precise.


Unfortunately, the Ding & He paper contains some sloppy formulations (at best) and can easily be misunderstood. E.g. it might seem that Ding & He claim to have proved that cluster centroids of K-means clustering solution lie in the $(K-1)$-dimensional PCA subspace:

Theorem 3.3. Cluster centroid subspace is spanned by the first $K-1$ principal directions [...].

For $K=2$ this would imply that projections on PC1 axis will necessarily be negative for one cluster and positive for another cluster, i.e. PC2 axis will separate clusters perfectly.

This is either a mistake or some sloppy writing; in any case, taken literally, this particular claim is false.

Let's start with looking at some toy examples in 2D for $K=2$. I generated some samples from the two normal distributions with the same covariance matrix but varying means. I then ran both K-means and PCA. The following figure shows the scatter plot of the data above, and the same data colored according to the K-means solution below. I also show the first principal direction as a black line and class centroids found by K-means with black crosses. PC2 axis is shown with the dashed black line. K-means was repeated $100$ times with random seeds to ensure convergence to the global optimum.

PCA vs K-means

One can clearly see that even though the class centroids tend to be pretty close to the first PC direction, they do not fall on it exactly. Moreover, even though PC2 axis separates clusters perfectly in subplots 1 and 4, there is a couple of points on the wrong side of it in subplots 2 and 3.

So the agreement between K-means and PCA is quite good, but it is not exact.

So what did Ding & He prove? For simplicity, I will consider only $K=2$ case. Let the number of points assigned to each cluster be $n_1$ and $n_2$ and the total number of points $n=n_1+n_2$. Following Ding & He, let's define cluster indicator vector $\mathbf q\in\mathbb R^n$ as follows: $q_i = \sqrt{n_2/nn_1}$ if $i$-th points belongs to cluster 1 and $q_i = -\sqrt{n_1/nn_2}$ if it belongs to cluster 2. Cluster indicator vector has unit length $\|\mathbf q\| = 1$ and is "centered", i.e. its elements sum to zero $\sum q_i = 0$.

Ding & He show that K-means loss function $\sum_k \sum_i (\mathbf x_i^{(k)} - \boldsymbol \mu_k)^2$ (that K-means algorithm minimizes), where $x_i^{(k)}$ is the $i$-th element in cluster $k$, can be equivalently rewritten as $-\mathbf q^\top \mathbf G \mathbf q$, where $\mathbf G$ is the $n\times n$ Gram matrix of scalar products between all points: $\mathbf G = \mathbf X_c \mathbf X_c^\top$, where $\mathbf X$ is the $n\times 2$ data matrix and $\mathbf X_c$ is the centered data matrix.

(Note: I am using notation and terminology that slightly differs from their paper but that I find clearer).

So the K-means solution $\mathbf q$ is a centered unit vector maximizing $\mathbf q^\top \mathbf G \mathbf q$. It is easy to show that the first principal component (when normalized to have unit sum of squares) is the leading eigenvector of the Gram matrix, i.e. it is also a centered unit vector $\mathbf p$ maximizing $\mathbf p^\top \mathbf G \mathbf p$. The only difference is that $\mathbf q$ is additionally constrained to have only two different values whereas $\mathbf p$ does not have this constraint.

In other words, K-means and PCA maximize the same objective function, with the only difference being that K-means has additional "categorical" constraint.

It stands to reason that most of the times the K-means (constrained) and PCA (unconstrained) solutions will be pretty to close to each other, as we saw above in the simulation, but one should not expect them to be identical. Taking $\mathbf p$ and setting all its negative elements to be equal to $-\sqrt{n_1/nn_2}$ and all its positive elements to $\sqrt{n_2/nn_1}$ will generally not give exactly $\mathbf q$.

Ding & He seem to understand this well because they formulate their theorem as follows:

Theorem 2.2. For K-means clustering where $K= 2$, the continuous solution of the cluster indicator vector is the [first] principal component

Note that words "continuous solution". After proving this theorem they additionally comment that PCA can be used to initialize K-means iterations which makes total sense given that we expect $\mathbf q$ to be close to $\mathbf p$. But one still needs to perform the iterations, because they are not identical.

However, Ding & He then go on to develop a more general treatment for $K>2$ and end up formulating Theorem 3.3 as

Theorem 3.3. Cluster centroid subspace is spanned by the first $K-1$ principal directions [...].

I did not go through the math of Section 3, but I believe that this theorem in fact also refers to the "continuous solution" of K-means, i.e. its statement should read "cluster centroid space of the continuous solution of K-means is spanned [...]".

Ding & He, however, do not make this important qualification, and moreover write in their abstract that

Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. Equivalently, we show that the subspace spanned by the cluster centroids are given by spectral expansion of the data covariance matrix truncated at $K-1$ terms.

The first sentence is absolutely correct, but the second one is not. It is not clear to me if this is a (very) sloppy writing or a genuine mistake. I have very politely emailed both authors asking for clarification. (Update two months later: I have never heard back from them.)


Matlab simulation code

figure('Position', [100 100 1200 600])

n = 50;
Sigma = [2 1.8; 1.8 2];

for i=1:4
    means = [0 0; i*2 0];
    
    rng(42)
    X = [bsxfun(@plus, means(1,:), randn(n,2) * chol(Sigma)); ...
         bsxfun(@plus, means(2,:), randn(n,2) * chol(Sigma))];
    X = bsxfun(@minus, X, mean(X));
    [U,S,V] = svd(X,0);
    [ind, centroids] = kmeans(X,2, 'Replicates', 100);
    
    subplot(2,4,i)
    scatter(X(:,1), X(:,2), [], [0 0 0])

    subplot(2,4,i+4)
    hold on
    scatter(X(ind==1,1), X(ind==1,2), [], [1 0 0])
    scatter(X(ind==2,1), X(ind==2,2), [], [0 0 1])
    plot([-1 1]*10*V(1,1), [-1 1]*10*V(2,1), 'k', 'LineWidth', 2)
    plot(centroids(1,1), centroids(1,2), 'w+', 'MarkerSize', 15, 'LineWidth', 4)
    plot(centroids(1,1), centroids(1,2), 'k+', 'MarkerSize', 10, 'LineWidth', 2)
    plot(centroids(2,1), centroids(2,2), 'w+', 'MarkerSize', 15, 'LineWidth', 4)
    plot(centroids(2,1), centroids(2,2), 'k+', 'MarkerSize', 10, 'LineWidth', 2)
    
    plot([-1 1]*5*V(1,2), [-1 1]*5*V(2,2), 'k--')
end

for i=1:8
    subplot(2,4,i)
    axis([-8 8 -8 8])
    axis square
    set(gca,'xtick',[],'ytick',[])
end