Solved – Difference Between Cubic Clustering Criterion, Silhouette Score, and Calinski Harabasz

clusteringdbscanhierarchical clusteringmixed type dataward

I am clustering a mixed geological data set containing numeric (pump pressure, bit speed, mud temperature), nominal (presence or absence of a specific stones), and ordinal data (relative concentration of minerals with 0-absent, to 4-very abundant).

My candidates algorithms are Ward, DBSCAN and BIRCH. I am looking for a good validation criterion to determine the quality of the clustering output. I have read about the Cubic Clustering Criterion, but if you look at how it works it is quite similar to that of Silhouette Score, which measure the within sum-of-squares and between-sum-of squares. It can also be observed that these two have a distinction with that of Calinski-Harabazs.

Any advice on the advantages and disadvantages among these three validation metrics, given that I am clustering a mixed data type composed of numeric, nominal and ordinal data?

Best Answer

First of all, DBSCAN produces noise. And it's unclear how to correctly compute these indexes when there is noise. If you treat every noise point as its own cluster, the points will give you, e.g., a bad Silhouette, even when this is exactly the desire behavior. If you pretend noise is a cluster, the result will even be worse.

Secondly, you will first need to solve the problem of similarity measurement. If your similarity doesn't work (e.g. badly scaled) then the evaluation will prefer a bad result, too. But there is no mathematically "correct" way of scaling!