I am interested in determining the optimal number of clusters calculated by the PAM clustering algorithm using the Calinski-Harabasz (CH) index. To that end, I found 2 different R functions calculating CH values for a given clustering, but which returned different results: ?cluster.stats (in the fpc package), and ?index.G1 (in the clusterSim package).
First one is called via:
pam.res <- pam(dist.matrix, 2, diss=TRUE)
ch1 <- cluster.stats(dist.matrix, pam.res$clustering, silhouette=TRUE)$ch
Second one is called via:
ch2 <- index.G1(t(dataframe), pam.res$clustering, d=dist.matrix)
Data may be found here: dataframe.RData, or here: dist.matrix.RData [dead links].
-
Can anybody explain the difference between these two CH index calculations to me?
Using
cluster.stats()
, the highest CH index is obtained for 2 clusters ($\approx32$); while usingindex.G1()
, the highest CH index is obtained for 3 clusters ($\approx60$, and the value for 2 clusters is totally different from the previous, $\approx54$). -
Which function is normally used to calculate the CH index?
Best Answer
There is one method of calculating Caliński & Harabasz (1974) index for the same distance matrix, so if two R functions show different results one of them is wrong. Hence your question is off-topic.
Look how Caliński & Harabasz index is calculated, in their original paper [1] or e.g. here.
Then check the source code of both R functions, find a bug and report it to package creators.
Here are
fpc
andclusterSim
example sites on GitHub where you can view the source code: https://github.com/cran/fpc/tree/master/R, https://github.com/cran/clusterSim.[1] Caliński, T., and J. Harabasz. "A dendrite method for cluster analysis." Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.