Solved – which metrics are suitable for density-based clustering validation

clusteringdbscanmodel-evaluationoutliers

I'm working on a project where I use several clustering methods, mainly density based ones such as hdbscan, optics… I'm looking for a metric to evaluate clustering results that takes into account outliers and different forms of clusters. One of the evaluation metrics I found is DBCV, it hasn't received enough attention in the datascience community, so I'm not sure about its robustness. Also in runtime it is unsuitable when we have several thousand points, even in two dimensions.

DBCV source code: https://github.com/christopherjenness/DBCV

Best Answer

I know it's a little bit late but I just wanted to say that I'm currently studying density based clustering algorithms, and I found out the most suitable metric was in fact DBCV:

1) it deals with noise (which is intrinsic to the definition of the density-based clustering, and it's not taken into account in indexes such as Silhouette or Davies Bouldin)

2) it allows you to capture the shape of each cluster creating an MST employing 'density', no distances (you can manage arbitrary-shaped clusters, which is not possible if you use metrics like the ones mentioned above)

Here, you can check out the DBCV paper for a better understanding.

I tried the same implementation than you did, but finally found a better one, used in the hdbscan implementation from the sci kit learn contribution repository (it's faster bc it has many functions coded in C).

Heard that another commonly used index is CDbw, because it let you choose how many representatives for each cluster you want to use. However, all clusters will have the same amount of representatives, and also the number must be specified by the user, which is not desirable due to the fact it is another parameter that must be tuned properly .... In the DBCV paper you will find a comparison between many metrics, including CDbw.

Cheers!

Related Solutions

Solved – How to implement density-based clustering

Given you have a large dataset data as well as noisy data I strongly recommend that a dimensional reduction step is done prior to clustering. This should allow potentially irrelevant variation to be filtered out and the clustering algorithm to work in a lower dimensional space. Standard dimensional reduction techniques like Principal Component Analysis (PCA) and Locality-sensitive hashing (LSH) are two standard approaches.

Detecting density-based clusters in high-dimensional spaces, even when having noiseless data can be very demanding. High-dimensional density estimation is a typical scenario where the curse of dimensionality manifests. DBSCAN ultimately relies on finding fixed-radius nearest neighbours for each point. As the dimensionality of the data increases this nearest neighbour (NN) finding task becomes more and more attenuated. In addition (and most importantly) a standard distance metric as Euclidean distance gets potentially increasingly irrelevant. Therefore even if we have a distance and neighbourhood to work with that information is not very useful. This association between curse of dimensionality and NN-related tasks has been touched upon many time in CV, eg. see 1, 2, 3, 4.

By the way, something "simple" like the following script where $N$ is quite larger than just 600$k$ as in your case, runs on my laptop (Intel i5 U-series) in under 5 minutes using ~10 GB of RAM. This is because the fixed-radius nearest neighbour problem mentioned above is solved within the library dbscan using $k$-d trees; most NN-finding routines use some approximation approach; otherwise even cases with just a few more than tenths of thousands of points would get prohibitively large to work with when involving $O(n^2)$ requirements. So while not "instant" a use-case with 600$k$ points is definitely doable in R given a standard workstation and some appropriate dimensional reduction.

N = 10^7; p = 50;
Q = matrix(nrow = N, rt(N*p, df = 2))
library(dbscan)
W = dbscan(Q, eps= 0.01, minPts = 100) # About 4.5'

Solved – Unsupervised outlier detection in 2D space

Your task seems to be rather a clustering than an outlier detection task.

In the following, I use this popular data set of User locations (Joensuu).

Running OPTICS with the parameters

-dbc.in /tmp/MopsiLocations2012-Joensuu.txt
-algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.05
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 50

yields the following (hierarchical) clustering. You can see there are three larger clusters (corresponding to Joensuu, Lieska, and Savijärvi; note that the plot has latitude and longitude 'the wrong way'), and some noise (violet here) that is not density-reachable with 5km distance and 50 points. These are your outliers.

You can tell there are some subclusters in both cities. For example one corresponding to the Prisma Joensuu shopping mall. To see more detail, it is helpful to further reduce epsilon, maybe to just 500 meters.

Best Answer

Related Solutions

Solved – How to implement density-based clustering

Solved – Unsupervised outlier detection in 2D space

Related Question