Solved – Using calinski-Harabasz Index to find parameters of DBSCAN

clustering

I would like to use the calinski-Harabasz Index to evaluate different runs of the DBSCAN algorithm (different min_points).

According to sklearn's documentation, the index is "generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.".

Since I am comparing different runs of DBSCAN i.e not comparing DBSCAN to, say, K-Means, would it make sense to use the index?

Please note that I have tried to use metrics specifically designed for Desnsity-Based clusters like the Density-Based Clustering Validation DBCV, but its computational complexity was bigger than what I can afford (I am clustering around 200,000 real-valued vectors of dimension 300). The main reason for choosing the calinski-Harabasz Index is that it is fast to compute.

Best Answer

If the index has a clear preference for convex clusters (and if the implementation actually understands noise, where I wouldn't be too sure), you can try this, but I would not recommend to do so.

The reason is simple you'll be evaluating the result by how good it matches the C-H assumptions, i.e., also by how convex the clusters are. If your data doesn't have convex clusters, this index may prefer a suboptimal solution.

If you have a lot of data, you can try to approximate DBCV maybe with a sample?

Also, it shouldn't be necessary to "optimize" minPts. It is supposedly quite stable, and you should be able to choose it heuristically based on the data dimensionality and data set size. At 200k points, I'd just try 50.