Calinski-Harabasz Criterion – Determining Acceptable Values for Clustering

clusteringpanel datar

I have done a data analysis trying to cluster longitudinal data using R and the kml package. My data contains of around 400 individual trajectories (as it is called in the paper). You can see my results in the following picture:

enter image description here

After reading chapter 2.2 "Choosing an optimal number of clusters" in the corresponding paper I didn't get any answers. I would prefer having 3 clusters but will the result still being Ok with a CH of 80. Actually I even don't know what the CH value represents.

So my question, what is an acceptable value of the Calinski & Harabasz (CH) criterion?

Best Answer

There are a few things one should be aware of.

  • Like most internal clustering criteria, Calinski-Harabasz is a heuristic device. The proper way to use it is to compare clustering solutions obtained on the same data, - solutions which differ either by the number of clusters or by the clustering method used.

  • There is no "acceptable" cut-off value. You simply compare CH values by eye. The higher the value, the "better" is the solution. If on the line-plot of CH values there appears that one solution give a peak or at least an abrupt elbow, choose it. If, on the contrary, the line is smooth - horizontal or ascending or descending - then there is no reason to prefer one solution to others.

  • CH criterion is based on ANOVA ideology. Hence, it implies that the clustered objects lie in Euclidean space of scale (not ordinal or binary or nominal) variables. If the data clustered were not objects X variables but a matrix of dissimilarities between objects then the dissimilarity measure should be (squared) euclidean distance (or, at worse, am other metric distance approaching euclidean distance by properties).

  • CH criterion is most suitable in case when clusters are more or less spherical and compact in their middle (such as normally distributed, for instance)$^1$. Other conditions being equal, CH tends to prefer cluster solutions with clusters consisting of roughly the same number of objects.

Let's observe an example. Below is a scatterplot of data that were generated as 5 normally distributed clusters which lie quite close to each other.

enter image description here

These data were clustered by hierarchical average-linkage method, and all cluster solutions (cluster memberships) from 15-cluster through 2-cluster solution were saved. Then two clustering criteria were applied to compare the solutions and to select the "better" one, if there is any.

enter image description here

Plot for Calinski-Harabasz is on the left. We see that - in this example - CH plainly indicates 5-cluster solution (labelled CLU5_1) as the best one. Plot for another clustering criterion, C-Index (which is not based on ANOVA ideology and is more universal in its application than CH) is on the right. For C-Index, a lower value indicates a "better" solution. As the plot shows, 15-cluster solution is formally the best. But remember that with clustering criteria rugged topography is more important in decision than the magnitude itself. Note there is the elbow at 5-cluster solution; 5-cluster solution is still relatively good while 4- or 3-cluster solutions deteriorate by leaps. Since we usually wish to get "a better solution with less clusters", the choice of 5-cluster solution appears to be reasonable under C-Index testing, too.

P.S. This post also brings up the question whether we should trust more the actual maximum (or minimum) of a clustering criterion or rather a landscape of the plot of its values.


$^1$ Later note. Not quite so as written. My probes on simulated datasets convince me that CH has no preference to bell shape distribution over platykurtic one (such as in a ball) or to circular clusters over ellipsoidal ones, - if keeping intracluster overall variances and intercluster centroid separation the same. One nuance worth to keep in mind, however, is that if clusters are required (as usual) to be nonoverlapping in space then a good cluster configuration with round clusters is just easier to encounter in real practice as a similarly good configuration with oblong clusters ("pencils in a case" effect); that has nothing to do with a clustering criterion's biases.

An overview of internal clustering criteria and how to use them.

Related Question