Solved – Validating Clustering by Considering the Ratio of Intra-cluster to Inter-cluster Distance

clusteringdistanceunsupervised learning

I'm trying to evaluate a clustering method by looking at the ratio of the mean intra-clustering distance (the average distance between points in the same cluster) to the mean inter-cluster distance (in my case, defined as the average between all pairs of points).

This metric would belong to the class of "internal" clustering methods that try to evaluate clustering without knowledge of ground-truth labels. My basic question is: is there a name for this ratio?

When I look at lists of internal clustering methods (e.g. https://en.wikipedia.org/wiki/Cluster_analysis#Internal_evaluation), I can't seem to find this specific quantity. There are related, but more complicated, quantities, like Dunn's index or the Silhouette coefficient, but what about just this simple ratio?

I'm wondering — perhaps this simple quantity is not used? If so, why not? If it is used, does it have a name?

Best Answer

For example, the Calinski-Harabasz variance ratio criterion (VRC) is fairly standard.

Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.

But there are many many more, such as C index, DBCV, etc.

I believe some of the indexes even had a dozen variants.

The Dunn index is essentially the ratio separation/compactness, while davies-bouldin is a compactness/separation. So I guess you are suggesting just one of the many variants of these two.

Note that if you have many clusters, it is better to only consider the nearby neighbor clusters, and not the average distance to all others! Assuming you have one very badly split cluster, but extremely well separated from the majority of the data, the naive within/inbetween quotient will fail. That is why you usually define separation based in the nearest other cluster(s) only, instead of the entire data set.

It just shows once more that you cannot rely on Wikipedia alone (and too many people and even books just copy from Wikipedia only...)

But beware that all these are just heuristics. You can find counterexamples for each, I suppose.

Best Answer

Related Solutions

Solved – Is it ok to use Manhattan distance with Ward’s inter-cluster linkage in hierarchical clustering

Solved – Choosing the number of clusters in hierarchical agglomerative clustering

Related Question