I'm very new to cluster analysis. I'm using R for k-means clustering and I wonder what those things are. And what is better if their ratio is smaller or larger?

# Solved – What does total ss and between ss mean in k-means clustering

clustering

#### Related Solutions

I believe that if you have a significant difference in your dependent variable between your clusters then the approach of clustering first will DEFINITELY be helpful. Regardless of your chosen learning algorithm.

It is my opinion that running a learnign algorithm on a whole base can cover up meaningful differences at a lower level of aggregation.

Anyone heard of simpson's paradox, it's a hard case of a deeper problem where you have different correlations in different groups that are covered up by larger sample noise and or weaker correlations of a larger group.

It is true that K-means clustering and PCA appear to have very different goals and at first sight do not seem to be related. However, as explained in the Ding & He 2004 paper K-means Clustering via Principal Component Analysis, there is a deep connection between them.

The intuition is that PCA seeks to represent all $n$ data vectors as linear combinations of a small number of eigenvectors, and does it to minimize the mean-squared reconstruction error. In contrast, K-means seeks to represent all $n$ data vectors via small number of cluster centroids, i.e. to represent them as linear combinations of a small number of cluster centroid vectors where linear combination weights must be all zero except for the single $1$. This is also done to minimize the mean-squared reconstruction error.

**So K-means can be seen as a super-sparse PCA.**

Ding & He paper makes this connection more precise.

Unfortunately, the Ding & He paper contains some sloppy formulations (at best) and can easily be misunderstood. E.g. it might seem that Ding & He claim to have proved that cluster centroids of K-means clustering solution lie in the $(K-1)$-dimensional PCA subspace:

Theorem 3.3. Cluster centroid subspace is spanned by the first $K-1$ principal directions [...].

For $K=2$ this would imply that projections on PC1 axis will necessarily be negative for one cluster and positive for another cluster, i.e. PC2 axis will separate clusters perfectly.

**This is either a mistake or some sloppy writing; in any case, taken literally, this particular claim is false.**

Let's start with looking at some toy examples in 2D for $K=2$. I generated some samples from the two normal distributions with the same covariance matrix but varying means. I then ran both K-means and PCA. The following figure shows the scatter plot of the data above, and the same data colored according to the K-means solution below. I also show the first principal direction as a black line and class centroids found by K-means with black crosses. PC2 axis is shown with the dashed black line. K-means was repeated $100$ times with random seeds to ensure convergence to the global optimum.

One can clearly see that even though the class centroids tend to be pretty close to the first PC direction, they do **not** fall on it exactly. Moreover, even though PC2 axis separates clusters perfectly in subplots 1 and 4, there is a couple of points on the wrong side of it in subplots 2 and 3.

**So the agreement between K-means and PCA is quite good, but it is not exact.**

So what did Ding & He prove? For simplicity, I will consider only $K=2$ case. Let the number of points assigned to each cluster be $n_1$ and $n_2$ and the total number of points $n=n_1+n_2$. Following Ding & He, let's define *cluster indicator vector* $\mathbf q\in\mathbb R^n$ as follows: $q_i = \sqrt{n_2/nn_1}$ if $i$-th points belongs to cluster 1 and $q_i = -\sqrt{n_1/nn_2}$ if it belongs to cluster 2. Cluster indicator vector has unit length $\|\mathbf q\| = 1$ and is "centered", i.e. its elements sum to zero $\sum q_i = 0$.

Ding & He show that K-means loss function $\sum_k \sum_i (\mathbf x_i^{(k)} - \boldsymbol \mu_k)^2$ (that K-means algorithm minimizes), where $x_i^{(k)}$ is the $i$-th element in cluster $k$, can be equivalently rewritten as $-\mathbf q^\top \mathbf G \mathbf q$, where $\mathbf G$ is the $n\times n$ Gram matrix of scalar products between all points: $\mathbf G = \mathbf X_c \mathbf X_c^\top$, where $\mathbf X$ is the $n\times 2$ data matrix and $\mathbf X_c$ is the centered data matrix.

*(Note: I am using notation and terminology that slightly differs from their paper but that I find clearer).*

So the K-means solution $\mathbf q$ is a centered unit vector maximizing $\mathbf q^\top \mathbf G \mathbf q$. It is easy to show that the first principal component (when normalized to have unit sum of squares) is the leading eigenvector of the Gram matrix, i.e. it is also a centered unit vector $\mathbf p$ maximizing $\mathbf p^\top \mathbf G \mathbf p$. The only difference is that $\mathbf q$ is additionally constrained to have only two different values whereas $\mathbf p$ does not have this constraint.

**In other words, K-means and PCA maximize the same objective function, with the only difference being that K-means has additional "categorical" constraint.**

It stands to reason that most of the times the K-means (constrained) and PCA (unconstrained) solutions will be pretty to close to each other, as we saw above in the simulation, but one should not expect them to be identical. Taking $\mathbf p$ and setting all its negative elements to be equal to $-\sqrt{n_1/nn_2}$ and all its positive elements to $\sqrt{n_2/nn_1}$ will generally **not** give exactly $\mathbf q$.

Ding & He seem to understand this well because they formulate their theorem as follows:

Theorem 2.2. For K-means clustering where $K= 2$, the continuous solution of the cluster indicator vector is the [first] principal component

Note that words "continuous solution". After proving this theorem they additionally comment that PCA can be used to initialize K-means iterations which makes total sense given that we expect $\mathbf q$ to be close to $\mathbf p$. But one still needs to perform the iterations, because they are not identical.

However, Ding & He then go on to develop a more general treatment for $K>2$ and end up formulating Theorem 3.3 as

Theorem 3.3. Cluster centroid subspace is spanned by the first $K-1$ principal directions [...].

I did not go through the math of Section 3, but I believe that this theorem in fact also refers to the "continuous solution" of K-means, i.e. its statement should read "cluster centroid space of the continuous solution of K-means is spanned [...]".

Ding & He, however, do not make this important qualification, and moreover write in their abstract that

Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. Equivalently, we show that the subspace spanned by the cluster centroids are given by spectral expansion of the data covariance matrix truncated at $K-1$ terms.

The first sentence is absolutely correct, but the second one is not. **It is not clear to me if this is a (very) sloppy writing or a genuine mistake.** I have very politely emailed both authors asking for clarification. *(Update two months later: I have never heard back from them.)*

## Matlab simulation code

```
figure('Position', [100 100 1200 600])
n = 50;
Sigma = [2 1.8; 1.8 2];
for i=1:4
means = [0 0; i*2 0];
rng(42)
X = [bsxfun(@plus, means(1,:), randn(n,2) * chol(Sigma)); ...
bsxfun(@plus, means(2,:), randn(n,2) * chol(Sigma))];
X = bsxfun(@minus, X, mean(X));
[U,S,V] = svd(X,0);
[ind, centroids] = kmeans(X,2, 'Replicates', 100);
subplot(2,4,i)
scatter(X(:,1), X(:,2), [], [0 0 0])
subplot(2,4,i+4)
hold on
scatter(X(ind==1,1), X(ind==1,2), [], [1 0 0])
scatter(X(ind==2,1), X(ind==2,2), [], [0 0 1])
plot([-1 1]*10*V(1,1), [-1 1]*10*V(2,1), 'k', 'LineWidth', 2)
plot(centroids(1,1), centroids(1,2), 'w+', 'MarkerSize', 15, 'LineWidth', 4)
plot(centroids(1,1), centroids(1,2), 'k+', 'MarkerSize', 10, 'LineWidth', 2)
plot(centroids(2,1), centroids(2,2), 'w+', 'MarkerSize', 15, 'LineWidth', 4)
plot(centroids(2,1), centroids(2,2), 'k+', 'MarkerSize', 10, 'LineWidth', 2)
plot([-1 1]*5*V(1,2), [-1 1]*5*V(2,2), 'k--')
end
for i=1:8
subplot(2,4,i)
axis([-8 8 -8 8])
axis square
set(gca,'xtick',[],'ytick',[])
end
```

###### Related Question

- Solved – Why not terminating the k-means clustering algorithm after one iteration
- Solved – K-means: Does it make sense to remove the outliers after clustering the datasets
- Solved – Difference between Hartigan & Wong Algo to Lloyd’s algorithm in K-means clustering
- Solved – What does minimising the loss function mean in k-means clustering

## Best Answer

It's basically a measure of the goodness of the classification k-means has found. SS obviously stands for Sum of Squares, so it's the usual decomposition of deviance in deviance "Between" and deviance "Within". Ideally you want a clustering that has the properties of internal cohesion and external separation, i.e. the BSS/TSS ratio should approach 1.

For example, in R:

gives a BSS/TSS ratio of 88.4% (0.884) indicating a good fit. You should be careful tough, and it's usually a good idea to plot the WSS against the number of cluster, since this number has to be specified beforehand.