Solved – Can someone explain the C-Index in the context of hierarchical clustering

clustering

This is a followup to this question. I am currently trying to implement the C-Index in order to find a near-optimal number of clusters from a hierarchy of clusters. I do this by calculating the C-Index for every step of the (agglomerative) hierarchical clustering. The problem is that the C-Index is minimal (0 to be exact) for very degenerated clusterings. Consider this:

$c = \frac{S-S_{min}}{S_{max}-S_{min}}$

In this case $S$ is the sum of all distances between pairs of observations in the same cluster over all clusters. Let $n$ be the number of these pairs. $S_{min}$ and $S_{max}$ are the sums of $n$ lowest/highest distances across all pairs of observations. In the first step of the hierarchical clustering, the two closest observations (minimal distance) are merged into a cluster. Let $d$ be the distance between these observations. Now there is one pair of observations in the same cluster, so $n=1$ (all other clusters are singletons). Consequently $S=d$. The problem is that $S_{min}$ also equals $d$, because $d$ is the smallest distance (that is why the observations where merged first). So for this case, the C-Index is always 0. It stays 0 as long as only singleton clusters are merged. This means the optimal clustering according the C-Index would always consist of a bunch of clusters containing two observations, and the rest singletons. Does this mean that the C-Index is not applicable to hierarchical clustering? Am I doing something wrong? I have searched a lot, but could not find any suitable explanation. Can someone refer me to some resource that is freely available on the internet? Or, if not, at least a book I may try to get at my universities library?

Thanks in advance!

Best Answer

This may be one of the cases where there's more art than science to clustering. I would suggest that you let your clustering algorithm run for a short time before letting the C-Index calculations kick in. "Short time" may be after processing a few pairs, just when it starts to exceed 0, or some other heuristic. (After all you don't expect to stop at 1 or 2 clusters, otherwise a different separation algorithm may have been deployed.)

For a book recommendation, I can suggest:

You can scan/search the available contents on google books to see if it might meet your needs. It's worked as a reference for me in the past.

Related Question