Clustering – How to Decide on the Correct Number of Clusters

clusteringk-means

We find the cluster centers and assign points to k different cluster bins in k-means clustering which is a very well known algorithm and is found almost in every machine learning package on the net. But the missing and most important part in my opinion is the choice of a correct k. What is the best value for it? And, what is meant by best?

I use MATLAB for scientific computing where looking at silhouette plots is given as a way to decide on k discussed here. However, I would be more interested in Bayesian approaches. Any suggestions are appreciated.

Best Answer

This has been asked a couple of times on stackoverflow: here, here and here. You can take a look at what the crowd over there thinks about this question (or a small variant thereof).

Let me also copy my own answer to this question, on stackoverflow.com:

Unfortunately there is no way to automatically set the "right" K nor is there a definition of what "right" is. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters, and I think this is a big open problem in the clustering/unsupervised learning research community.