Solved – Why is the k-means algorithm minimizing the within cluster variance

clusteringk-meanssums-of-squaresvariance

I have read that the k-means algorithm tries to minimize the within cluster sum of squares (or variance). With some brainstorming, a question popped up. Why is it that k-means or any other clustering algorithm that has within cluster variance as its objective to minimize, chose this as the objective function to minimize? What is it about within cluster variance that helps you decide that this is what you want to focus while clustering? – And especially clustering?

Let me put the question in another way (This question can be a sub-question or another way of putting the same question). Why would you say that minimizing within cluster variance is the right way of clustering (referring to the algorithms that minimize it)? Can there be other objective functions that can be minimized (or maximized or anything) for clustering?

Best Answer

Within-cluster-variance is a simple to understand measure of compactness (there are others, too).

So basically, the objective is to find the most compact partitioning of the data set into $k$ partitions.

K-Means, in the Lloyd version, actually originated from 1d PCM data as far as I know. So assuming you have a really bad telephone line, and someone is bleeping a number of tones on you, how do you assign frequencies to a scale of say 10 tones? Well, you can tell the other to just send a bulk of data, store the frequencies in a list, and then try to split it into 10 bins, such that these bins are somewhat compact and separated. Even when the frequencies are distorted by the transmission, there is a good chance they will still be separable with this approach.

This is also why k-means usually comes out best when you evaluate clusterings with any other measure of compactness. Because it's just two different measures for a similar concept.