From a conceptual and algorithmic standpoint, I understand how K-means works. However, from a mathematical standpoint, I don't understand why minimizing the WCSS (within-cluster sums of squares) will necessarily maximize the distance between clusters. In other words, can somebody show how this function is equal to maximizing the distance between clusters? It would be helpful to see a derivation that shows all of the steps or please point me to the appropriate reference(s).
Update I found this Witten and Tibshirani reference but it isn't obvious how to get from equation 7 to equation 8.
Best Answer
K-means is all about the analysis-of-variance paradigm. ANOVA - both uni- and multivariate - is based on the fact that the sum of squared deviations about the grand centroid is comprised of such scatter about the group centroids and the scatter of those centroids about the grand one: SStotal=SSwithin+SSbetween. So, if SSwithin is minimized then SSbetween is maximized.
SS of deviations of some points about their centroid (arithmetic mean) is known to be directly related to the overall squared euclidean distance between the points: the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. (This is the direct extension of the trigonometric property of centroid. And this relation is exploited also in the double centering of distance matrix.)
Thus, saying "SSbetween for centroids (as points) is maximized" is alias to say "the (weighted) set of squared distances between the centroids is maximized".
Note: in SSbetween each centroid is weighted by the number of points
Ni
in that clusteri
. That is, each centroid is countedNi
times. For example, with two centroids in the data,1
and2
, SSbetween =N1*D1^2+N2*D2^2
whereD1
andD2
are the deviations of the centroids from the grand mean. That where word "weighted" in the former paragraph stems from.Example