Solved – Cluster Validation of Incomplete Clustering Algorithms (esp., Density based – DBSCAN, HDBSCAN)

clusteringdbscannoiseoutliers

Context —

Unlike, Partitional clustering algorithms like K-Means, Spectral or Hierarchal Methods, Incomplete clustering techniques like DBSCAN, HDBSCAN and many others have the notion of noise (outliers).

Common cluster validation (Internal and External) indices – Silhouette Index, C-Index, Dunn Index, Entropy, Within-cluster scatter … don't explicitly accommodate noise.

Questions —

  1. Do there exist special Cluster Validation indices that accommodate noise?
  2. What would be the appropriate treatment of noise, so that they fit well into all cluster indices ?

The cluster validation should reflect the loss of cluster-ability due to excessive noise.

Some thoughts —

  1. Treat the noise as a new (k+1 th) partition.
  2. Remove all noises from the picture (Not recommended)
  3. Re-assign noises into one of the existing clusters (by nearest neighbor technique or using some cluster indices)

Please feel free to put forward your suggestions and links to related research materials and posts.

Edit : Let's assume a two class problem.

Best Answer

There exist different ways of handling noise. If I recall correctly, this is discussed in the DBCV paper (density-based cluster validation). The ELKI clustering toolkit has an option of how to handle noise clusters during evaluation.

I am not convinced by these measures. I believe that with a trivial postprocessing you can "optimize" your clustering for most metrics (e.g. assigning noise to their nearest cluster will improve silhouette) without theoretical support or any practical usefulness.

In my opinion,clustering needs to be treated as an explorative technique: it does not matter if it can improve some useless statistical score. The only thing that matters is, if it allows a human to better understand data.

Related Question