K-Means Clustering – How to Understand Its Drawbacks

clusteringdata miningk-meansmachine learning

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of squared errors (SSE), the within cluster squared error.

So k-means is essentially an optimization problem.

I read some material about the drawbacks of k-means. Most of them say that:

  • k-means assumes the variance of the distribution of each attribute (variable) is spherical;
  • all variables have the same variance;
  • the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations;

If any one of these 3 assumptions are violated, then k-means will fail.

I could not understand the logic behind this statement. I think the k-means method makes essentially no assumptions, it just minimizes the SSE, so I cannot see the link between minimizing the SSE and those 3 "assumptions".

Best Answer

While I like David Robinson's answer here a lot, here's some additional critique of k-means.

Clustering non-clustered data

Run k-means on uniform data, and you will still get clusters! It doesn't tell you when the data just does not cluster, and can take your research into a dead end this way.

K-means on uniform data

Sensitive to scale

Rescaling your datasets will completely change results. While this itself is not bad, not realizing that you have to spend extra attention to scaling your data is bad. Scaling factors are extra $d$ hidden parameters in k-means that "default" to 1 and thus are easily overlooked, yet have a major impact (but of course this applies to many other algorithms, too).

This is probably what you referred to as "all variables have the same variance". Except that ideally, you would also consider non-linear scaling when appropriate.

Also be aware that it is only a heuristic to scale every axis to have unit variance. This doesn't ensure that k-means works. Scaling depends on the meaning of your data set. And if you have more than one cluster, you would want every cluster (independently) to have the same variance in every variable, too.

Here is a classic counterexample of data sets that k-means cannot cluster. Both axes are i.i.d. in each cluster, so it would be sufficient to do this in 1 dimension. But the clusters have varying variances, and k-means thus splits them incorrectly.

K-means cannot cluster this data set

I don't think this counterexample for k-means is covered by your points:

  • All clusters are spherical (i.i.d. Gaussian).
  • All axes have the same distribution and thus variance.
  • Both clusters have 500 elements each.

Yet, k-means still fails badly (and it gets worse if I increase the variance beyond 0.5 for the larger cluster) But: it is not the algorithm that failed. It's the assumptions, which don't hold. K-means is working perfectly, it's just optimizing the wrong criterion.

Even on perfect data sets, it can get stuck in a local minimum

Below is the best of 10 runs of k-means on the classic A3 data set. This is a synthetic data set, designed for k-means. 50 clusters, each of Gaussian shape, reasonably well separated. Yet, it only with k-means++ and 100 iterations I did get the expected result... (below is 10 iterations of regular k-means, for illustration).

k-means on A3 data set

You'll quickly find many clusters in this data set, where k-means failed to find the correct structure. For example in the bottom right, a cluster was broken into three parts. But there is no way, k-means is going to move one of these centroids to an entirely different place of the data set - it's trapped in a local minimum (and this already was the best of 10 runs!)

And there are many of such local minima in this data set. Very often when you get two samples from the same cluster, it will get stuck in a minimum where this cluster remains split, and two other clusters merged instead. Not always, but very often. So you need a lot of iterations to have a lucky pick. With 100 iterations of k-means, I still counted 6 errors, and with 1000 iterations I got this down to 4 errors. K-means++ by the way it weights the random samples, works much better on this data set.

Means are continuous

While you can run k-means on binary data (or one-hot encoded categorical data) the results will not be binary anymore. So you do get a result out, but you may be unable to interpret it in the end, because it has a different data type than your original data.

Hidden assumption: SSE is worth minimizing

This is essentially already present in above answer, nicely demonstrated with linear regression. There are some use cases where k-means makes perfect sense. When Lloyd had to decode PCM signals, he did know the number of different tones, and least squared error minimizes the chance of decoding errors. And in color quantization of imaged, you do minimize color error when reducing the palette, too. But on your data, is the sum of squared deviations a meaningful criterion to minimize?

In above counterexample, the variance is not worth minimizing, because it depends on the cluster. Instead, a Gaussian Mixture Model should be fit to the data, as in the figure below:

Gaussian Mixture Modeling

(But this is not the ultimate method either. It's just as easy to construct data that does not satisfy the "mixture of k Gaussian distributions" assumptions, e.g., by adding a lot of background noise)

Too easy to use badly

All in all, it's too easy to throw k-means on your data, and nevertheless get a result out (that is pretty much random, but you won't notice). I think it would be better to have a method which can fail if you haven't understood your data...

K-means as quantization

If you want a theoretical model of what k-means does, consider it a quantization approach, not a clustering algorithm.

The objective of k-means - minimizing the squared error - is a reasonable choice if you replace every object by its nearest centroid. (It makes a lot less sense if you inspect the groups original data IMHO.)

There are very good use cases for this. The original PCM use case of Lloyd comes to mind, or e.g. color quanization (Wikipedia). If you want to reduce an image to k colors, you do want to replace every pixel with the nearest centroid. Minimizing the squared color deviation then does measure L2 optimality in image approximation using $k$ colors only.

This quantization is probably quite similar to the linear regression example. Linear regression finds the best linear model. And k-means finds (sometimes) the best reduction to k values of a multidimensional data set. Where "best" is the least squared error.

IMHO, k-means is a good quantization algorithm (see the first image in this post - if you want to approximate the data set to two points, this is a reasonable choice!). If you want to do cluster analysis as in discover structure then k-means is IMHO not the best choice. It tends to cluster when there are not clusters, and it cannot recognize various structures you do see a lot in data.


Fine print: all images were generated with ELKI. Data were generated using the .xml data generation format, but they are so basic it is not worth sharing them.