K-Means Clustering – Number of Clusters and Within-Cluster Sum of Squares

clusteringk-meanssums-of-squares

Given a collection of observatons $\{X_i\}_1^N$ and prespecify the number of clusters K. The K-means solves
$$
\underset{\{C_k\}_1^K}{\arg\min} \sum_{k=1}^K \sum_{i \in C_k}|| X_i – \mu_k||^2
$$

where $||\cdot||$ is the $\ell_2$ norm, $\mu_k$ is the mean of $X_i$ in $C_k$. In ISL and R function kmeans(), $\sum_{i \in C_k}|| X_i – \mu_k||^2$ is the within-cluster sum of square.

My questions:

Q1. AS the value of K increases, what is the behavior of the total within-cluster sum of squares $\sum_{k=1}^K \sum_{i \in C_k}|| X_i – \mu_k||^2$?

Q2. (This is an open question) In order to increase the accuracy of assignment, I attempt to choose a sufficiently large K. I hope that each subgroups will only(mostly) contain observations from the common true group. Then we can combine these subgroups to approximate the unknow true clustering. Is my idea feasible?

My thinking:

Q1. I see that kmeans is a random algorithm and it is likely to find a locally minima (discussed here). But is there any theorem(paper, or case study) to explain the (stochastic) behavior of
total within-cluster sum of squares and its upper bound (or other results)?

Take the iris data in R as an example
enter image description here

Q2. I plot the partition result for K varying form 3 to 12. Different colors(black, blue and red) represent the true species of each observation, and different symbols represent different clusters from kmeans. It seems that my idea is meaningful.

K=3

K=8

K=12

Thank you for your opinions!

Best Answer

Preamble: You talk about the "underlying true clusters", but in applied clustering this is a highly problematic concept. Assuming a certain model, one can define what is meant by "true clustering", but more than one definition is possible (for example a mixture distribution of 6 Gaussians may have only three modes, and one can define the "true clustering" as corresponding to the six Gaussians or the three modes, which gives different concepts for what the term "true clustering" actually means). In a real application, it will always depend on the nature of the data and the aim of clustering what kind of "true clusters" a researcher wants to find. The data on their own can not decide this for you! Some more insight is in this excerpt from the Handbook of Cluster Analysis: https://arxiv.org/abs/1503.02059

Another thing to know here is that the behaviour of the global optimum of the K-means objective function may differ from the behaviour of local optima that an algorithm actually will find. Unfortunately, unless the data set is very small, no algorithm that runs in realistic time will guarantee that you find the global optimum.

Q1: Using the global optimum, the WSS (within-cluster sum of squares) will never increase with increasing $K$ (this can be proved showing that if you have a solution for $K-1$ clusters, you can always reproduce the same WSS value for $K$ clusters not assigning any point to the added $K$th cluster, keeping all the other cluster centers as they are, so the same WSS can always be achieved - in fact, as long as there are at least $K$ distinct points, one can even easily always decrease the WSS by putting cluster center $K$ exactly on a point where no cluster center was before).

This does not necessarily hold for a local optimum in a real analysis. However, it is probably safe to say that the WSS should mostly decrease when increasing $K$ even then. It is also possible to construct an algorithm for finding local optima so that the WSS cannot increase when increasing $K$, but this may not be the best algorithm to be had when looking at other performance criteria. Furthermore, if $K=N$ (in fact if $K$ equals the number of distinct points in the data), the WSS will be 0, its lowest possible value.

Q2: The preamble implies that any answer to this question requires the nontrivial and non-unique definition of what you actually mean by "true clusters", and in any real application, neither will the data tell you this, nor is there any guarantee that any formal model-based definition will correspond to what is really meaningful. No real progress (as in: "really meaningful for real data") can be made here without knowledge of what the data are, the aim of clustering, and potentially other background of the real application.

That said, your idea is not stupid, and there is some literature about it:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5935272/

https://link.springer.com/article/10.1007%2Fs00357-019-09314-8

Actually, the traditional Ward's hierarchical clustering method can be interpreted as doing a very similar thing (at each stage two clusters are combined by use of the WSS criterion). However, as said before, there is no way to determine from the data alone without additional user input how much "combining" there should be and when to stop.

Related Question