Clustering – What to Do When a Centroid Doesn’t Attract Any Points in K-means?

clusteringk-means

I am solving a clustering problem on a trivial dataset with the k-means algorithm. I am running a couple of tests and, from time to time, one centroid doesn't attract any points, i.e. I've got an empty cluster (see the purple "x" in the picture).

What should I do? Shall I delete it or just stop updating its value? Why?

I am aware that built-in functions (e.g., kmeans() in R) have automatic ways of dealing with this situation, but I am trying to write the standard algorithm from scratch. As soon as I fix it I'll be able to compare my results to built-in functions. At this moment I'm looking for some theoretical reasons why I should prefer one solution or another.

In the picture each colour represents a cluster according to the current iteration and each "X" is its centroid (old ones have been kept and marked with the number of the iteration in which they were computed).

Each colour represent a cluster in the current iteration and each X represent its centroid (each centroid is marked with the number of the iteration in which it has been computed)


[Some related links about empty cluster case, added later by @ttnphns: https://stackoverflow.com/q/41634047/868905;
what's the implementation of SciKit-Learn K-Means for empty clusters?;
Result of K-Means Algorithm Not Desired;
How to implement k-means cluster analysis algorithm correctly?;
https://stackoverflow.com/q/24919346/868905;
https://stackoverflow.com/q/18009664/868905;
https://stackoverflow.com/q/29243800/868905;
https://stackoverflow.com/q/11075272/868905; ]

Best Answer

This can naturally happen in Lloyds algorithm; don't try to prevent it. Instead, implement one of the workarounds (e.g. choosing the most-distant point as additional cluster centroid, or simply allowing empty clusters).

You may want to put some safeguards in place - for example, when k is chosen larger than the data set size, there just is no solution that doesn't involve empty clusters.

Note that this mostly happens when you have really bad starting centroids. E.g. when you initialized them by randomly placing centriods on your data space. Choosing existing data points as initial centroids is much more robust (but this may still happen)

MacQueens k-means should be safe from this effect, too.