Solved – How to measure cluster quality with distance matrix

clustering

When performing clustering with an algorithm such as K-means, it's possible to construct a plot that shows the intra cluster variability according to the number of clusters to see if there is an elbow that suggest an optimal number of clusters.

However, when working with a dissimilarity matrix (with all values in the range 0-1), it's not so obvious how to measure the quality of clusters obtained. Suppose I had a dataset with a mixture of numerical, categorical and ordinal variables, so I couldn't calculate the loss function with which K-means works, for example.

In this case, I can run K-medoids or hierarchical clustering, but how can I produce some metric that could suggest the number of clusters, analog to the inter/intra variability of algorithms that work with only numeric data?

Best Answer

If you read the section on internal cluster validation in Wikipedia, you will learn about a dozen measures for evaluation that require paiwise distances. E.g.

  • Silhouette index
  • Dunn's index
  • Davids-Bouldin

and many more.

On the other hand, none of them has me really convinced yet.

Related Question