Solved – which metrics are suitable for density-based clustering validation

clusteringdbscanmodel-evaluationoutliers

I'm working on a project where I use several clustering methods, mainly density based ones such as hdbscan, optics… I'm looking for a metric to evaluate clustering results that takes into account outliers and different forms of clusters. One of the evaluation metrics I found is DBCV, it hasn't received enough attention in the datascience community, so I'm not sure about its robustness. Also in runtime it is unsuitable when we have several thousand points, even in two dimensions.

DBCV source code: https://github.com/christopherjenness/DBCV

Best Answer

I know it's a little bit late but I just wanted to say that I'm currently studying density based clustering algorithms, and I found out the most suitable metric was in fact DBCV:

1) it deals with noise (which is intrinsic to the definition of the density-based clustering, and it's not taken into account in indexes such as Silhouette or Davies Bouldin)

2) it allows you to capture the shape of each cluster creating an MST employing 'density', no distances (you can manage arbitrary-shaped clusters, which is not possible if you use metrics like the ones mentioned above)

Here, you can check out the DBCV paper for a better understanding.

I tried the same implementation than you did, but finally found a better one, used in the hdbscan implementation from the sci kit learn contribution repository (it's faster bc it has many functions coded in C).

Heard that another commonly used index is CDbw, because it let you choose how many representatives for each cluster you want to use. However, all clusters will have the same amount of representatives, and also the number must be specified by the user, which is not desirable due to the fact it is another parameter that must be tuned properly .... In the DBCV paper you will find a comparison between many metrics, including CDbw.

Cheers!