Solved – Intuition behind the Calinski-Harabasz Index

clusteringk-means

Given $CH(k) = [B(k) / W(k) ] \times [(n-k)/(k-1)]$, where
$n$ = # data points
$k$ = # clusters
$W(k)$ = within cluster variation
$B(k)$ = between cluster variation.

It is my understanding that the CH index can show the optimal number of clusters when doing k-means or hierarchical clustering; you would choose the number of clusters $k$ that maximize $CH(k)$. As $k$ increases, $B(k)$ increases, and $W(k)$ decreases.

However, can someone explain to me the intuition behind the second part of the formula, namely $[(n-k) / (k-1)]$? Isn't it too punitive for cases where $n$ is very large, since increasing $k$ by 1 will drastically decrease the whole term?

Best Answer

Some simple intuition: $[B(k)/(k-1)]/[W(k)/(n-k)]$ is analogous to an F-ratio in ANOVA; $B(k)$ and $W(k)$ are between- and within-cluster sums of squares for the $k$ clusters.

$B(k)$ has $k-1$ degrees of freedom, while $W(k)$ has $n-k$ degrees of freedom.

As $k$ grows, if the clusters were all actually just from the same population, $B$ should be proportional to $k-1$ and $W$ should be proportional to $n-k$.

So if we scale for those degrees of freedom, it puts them more on the same scale (apart, of course, from the effectiveness of the clustering, which is what the index attempts to measure).