Solved – Inconsistency in calculating the Calinski-Harabasz index for a given clustering in R

clusteringr

I am interested in determining the optimal number of clusters calculated by the PAM clustering algorithm using the Calinski-Harabasz (CH) index. To that end, I found 2 different R functions calculating CH values for a given clustering, but which returned different results: ?cluster.stats (in the fpc package), and ?index.G1 (in the clusterSim package).

First one is called via:

pam.res <- pam(dist.matrix, 2, diss=TRUE)
ch1     <- cluster.stats(dist.matrix, pam.res$clustering, silhouette=TRUE)$ch

Second one is called via:

ch2 <- index.G1(t(dataframe), pam.res$clustering, d=dist.matrix)

Data may be found here: dataframe.RData, or here: dist.matrix.RData [dead links].

  • Can anybody explain the difference between these two CH index calculations to me?

    Using cluster.stats(), the highest CH index is obtained for 2 clusters ($\approx32$); while using index.G1(), the highest CH index is obtained for 3 clusters ($\approx60$, and the value for 2 clusters is totally different from the previous, $\approx54$).

  • Which function is normally used to calculate the CH index?

Best Answer

There is one method of calculating Caliński & Harabasz (1974) index for the same distance matrix, so if two R functions show different results one of them is wrong. Hence your question is off-topic.

  • Look how Caliński & Harabasz index is calculated, in their original paper [1] or e.g. here.

  • Then check the source code of both R functions, find a bug and report it to package creators.

Here are fpc and clusterSim example sites on GitHub where you can view the source code: https://github.com/cran/fpc/tree/master/R, https://github.com/cran/clusterSim.

[1] Caliński, T., and J. Harabasz. "A dendrite method for cluster analysis." Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.