Solved – Elbow criteria to determine number of cluster

clusteringk-means

It is mentioned here that one of the methods to determine the optimal number of clusters in a data-set is the "elbow method". Here the percentage of variance is calculated as the ratio of the between-group variance to the total variance.

I felt difficult in understanding this calculation. Can any one explain how to calculate the percentage of variance for a data-set represented as feature matrix $F \in \mathbf{R}^{m \times n}$, where $m$ is the feature dimension and $n$ is the number of data points. I use the k-means algorithm for clustering.

Best Answer

The idea underlying the k-means algorithm is to try to find clusters that minimize the within-cluster variance (or up to a constant the corresponding sum of squares or SS), which amounts to maximize the between-cluster SS because the total variance is fixed. As mentioned on the wiki, you can directly use the within SS and look at its variation when increasing the number of clusters (like we would do in Factor Analysis with a screeplot): an abrupt change in how SS evolve is suggestive of an optimal solution, although this merely stands from visual appreciation. As the total variance is fixed, it is equivalent to study how the ratio of the between and total SS, also called the percentage of variance explained, evolves, because in this the case it will present a large gap from one k to the next k+1. (Note that the between/within ratio is not distributed as an F-distribution because k is not fixed; so, test are meaningless.)

In sum, you just have to compute the squared distance between each data point and their respective center (or centroid), for each cluster--this gives you the within SS, and the total within SS is just the sum of the cluster-specific WSS (transforming them to variance is just a matter of dividing by the corresponding degrees of freedom); the between SS is obtained by substracting the total WSS from the total SS, the latter being obtained by considering k=1 for example.

By the way, with k=1, WSS=TSS and BSS=0.

If you're after determining the number of clusters or where to stop with the k-means, you might consider the Gap statistic as an alternative to the elbow criteria:

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the numbers of clusters in a data set via the gap statistic. J. R. Statist. Soc. B, 63(2): 411-423.