Clustering – Why Minimizing WCSS Maximizes Distance Between Clusters in K-means

clusteringdistanceeuclideank-meanssums-of-squares

From a conceptual and algorithmic standpoint, I understand how K-means works. However, from a mathematical standpoint, I don't understand why minimizing the WCSS (within-cluster sums of squares) will necessarily maximize the distance between clusters. In other words, can somebody show how this function is equal to maximizing the distance between clusters? It would be helpful to see a derivation that shows all of the steps or please point me to the appropriate reference(s).

Update I found this Witten and Tibshirani reference but it isn't obvious how to get from equation 7 to equation 8.

Best Answer

K-means is all about the analysis-of-variance paradigm. ANOVA - both uni- and multivariate - is based on the fact that the sum of squared deviations about the grand centroid is comprised of such scatter about the group centroids and the scatter of those centroids about the grand one: SStotal=SSwithin+SSbetween. So, if SSwithin is minimized then SSbetween is maximized.

SS of deviations of some points about their centroid (arithmetic mean) is known to be directly related to the overall squared euclidean distance between the points: the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. (This is the direct extension of the trigonometric property of centroid. And this relation is exploited also in the double centering of distance matrix.)

Thus, saying "SSbetween for centroids (as points) is maximized" is alias to say "the (weighted) set of squared distances between the centroids is maximized".

Note: in SSbetween each centroid is weighted by the number of points Ni in that cluster i. That is, each centroid is counted Ni times. For example, with two centroids in the data, 1 and 2, SSbetween = N1*D1^2+N2*D2^2 where D1 and D2 are the deviations of the centroids from the grand mean. That where word "weighted" in the former paragraph stems from.

Example

Data (N=6: N1=3, N2=2, N3=1)
        V1            V2       Group
       2.06          7.73          1
        .67          5.27          1
       6.62          9.36          1
       3.16          5.23          2
       7.66          1.27          2
       5.59          9.83          3

SSdeviations
        V1            V2            Overall
SSt   37.82993333 + 51.24408333 = 89.07401666
SSw   29.50106667 + 16.31966667 = 45.82073333
SSb   8.328866667 + 34.92441667 = 43.25328333

SSt is directly related to the squared Euclidean distances between the data points:

Matrix of squared Euclidean distances
     .00000000    7.98370000   23.45050000    7.46000000   73.09160000   16.87090000 
    7.98370000     .00000000   52.13060000    6.20170000   64.86010000   45.00000000 
   23.45050000   52.13060000     .00000000   29.02850000   66.52970000    1.28180000 
    7.46000000    6.20170000   29.02850000     .00000000   35.93160000   27.06490000 
   73.09160000   64.86010000   66.52970000   35.93160000     .00000000   77.55850000 
   16.87090000   45.00000000    1.28180000   27.06490000   77.55850000     .00000000 

Its sum/2, the sum of the distances = 534.4441000
534.4441000 / N = 89.07401666 = SSt

The same reasoning holds for SSb.

Matrix of squared Euclidean distances between the 3 group centroids (see https://stats.stackexchange.com/q/148847/3277)
     .00000000   22.92738889   11.76592222 
   22.92738889     .00000000   43.32880000 
   11.76592222   43.32880000     .00000000 

3 centroids are 3 points, but SSb is based on N points (propagated centroids):
N1 points representing centroid1, N2 points representing centroid2 and N3 representing centroid3.
Therefore the sum of the distances must be weighted accordingly:
N1*N2*22.92738889 + N1*N3*11.76592222 + N2*N3*43.32880000 = 259.51969998
259.51969998 / N = 43.25328333 = SSb

Moral in words: maximizing SSb is equivalent to maximizing
the weighted sum of pairwise squared distances between the centroids.
(And maximizing SSb corresponds to minimizing SSw, since SSt is constant.)