Clustering – Alternatives to K-means When Assumptions Do Not Hold

clusteringk-means

Essentially this tutorial (http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html#example-cluster-plot-kmeans-assumptions-py ) explains very clearly some limitations of the k-means clustering method which are rooted in its assumptions.

Specifically:
1) k-means assumes the variance of the distribution of each attribute (variable) is spherical;
2) all variables have the same variance;
3) the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations;

Now, thanks to the silhouette method, I can handle the case with the wrong number of blobs but I am in the dark for how to handle the other cases. Could you point me to some methods to solve the other issues and if, possible, how to blend them toghether since in my datasets these behaviours coexist?

Best Answer

In the answers to How to understand the drawbacks of K-means we have already discussed drawbacks of k-means in detail.

Some of them may appear easy to counter for toy examples (e.g. by undoing scaling for distorted data sets) but real data will be much more complex, and global normalization may not be enough, unfortunately.

Rather than hot-fixing these problems, you should rather verify if the k-means objective solves your problem.

It's not just about clustering points into partitions. There is an underlying problem that you want to solve. With clustering it usually is about getting "insight" into your data. A simple approach here is to study the results carefully and discard any that seem suspicious. This cannot be automated (you can automate generating diverse results to avoid redundancies though, see "alternative clustering").

k-means is better be seen as a vector quantization approach rather than clustering. It does not attempt to identify the structure that is in the data, but it postulates a structure (k centers) and then optimizes the model parameters. During optimization it assumes that all variables are equal and the cost of each instance is measured as sum-of-squares $$\text{cost}(x_i) = \min_{\text{center }c} \sum_{\text{dim }d} (x_{i,d}-\mu_{c,d})^2$$ It will fail whenever the data does not have the postulated structure, or this cost function is inappropriate.

Thus, whenever considering k-means, the first thing you need to check if minimizing above equation solves your problem. If you don't want to answer the question "which centers have the smallest cost according to above equation" then k-means is the wrong algorithm!

Here is a positive example, where k-means is a good choice:

We have an RGB image with 24 bit depth (8bpp) and we want to compress that image for fast loading on the interwebz. Instead of storing 24 bit per pixel, we want to use just 4 bit that index into a 16 color palette. Run-length encoding will further help compressing the data. So we have to approximate every pixel with one palette entry. Larger deviations are bad, so using the square helps, and the color approximation error of a pixel can be computed as $\Delta R^2+\Delta G^2+\Delta B^2$. This matches exactly the cost function of k-means - and thus we can use this algorithm to optimize our color palette.