Solved – Does k-means have any advantages over HDBSCAN expect for runtime

clusteringdbscanhierarchical clusteringk-means

I have recently learned about HDBSCAN (a fairly new method for clustering, not yet available in scikit-learn) and am really surprised at how good it is. The following picture illustrates that the predecessor of HDBSCAN – DBSCAN – is already the only algorithm that performs perfectly on a sample of different clustering tasks:

Clustering

With HDBSCAN, you do not even need to set the distance parameter of DBSCAN, making it even more intuitive. I have tried it out on a few custom clustering tasks myself, and it always performed better than any other algorithm I have tried so far.

So my question is: Except for computation time, where k-means is still superior to all, is there any case were k-means might be superior? High-dimensional data for example, or a weird combination of clusters? I honestly can't really think of anything…

Best Answer

  1. Randomization can be valuable. You can run k-means several times to get different possible clusters, as not all may be good. With HDBSCAN, you will always get the same result again.

  2. Classifier: k-means yields an obvious and fast nearest-center classifier to predict the label for new objects. Correctly labeling new objects in HDBSCAN isn't obvious

  3. No noise. Many users don't (want to) know how to handle noise in their data. K-means gives a very simple and easy to understand result: every object belongs to exactly one cluster. With HDBSCAN, objects can belong to 0 clusters, and clusters are actually a tree and not flat.

  4. Performance and approximation. If you have a huge dataset, you can just take a random sample for k-means, and statistics says you'll get almost the same result. For HDBSCAN, it's not clear how to use it only with a subset of the data.

But don't get me wrong. IMHO k-means is very limited, hard to use, and often badly used on inappropriate problems and data. I do admire the HDBSCAN algorithm (and the original DBSCAN and OPTICS). On Geo data, these just work a thousand times better than k-means. K-means is totally overused (because too many classes do not teach anything except k-means), and mini-batch k-means is the worst version of k-means, it does not make sense to use it when your data fits into memory (hence it should be removed from sklearn IMHO).