Solved – How to measure the consistency of clustering results

clustering

I'm clustering data on a daily basis and would like to measure the consistency of the clustering method.

Let's say following clusters result in method A:

On day 1: {a,b,c} {d,f} {g}
On day 2: {a,b,c} {d,f} {g}
On day 3: {a,b,c} {d,f} {g}

With method B:

On day 1: {a,b,c} {d,f} {g}
On day 2: {a,b} {c,d,f} {g}
On day 3: {a,b} {c,d} {f} {g}

With method C:

On day 1: {a,b,c} {d,f} {g}
On day 2: {a} {b} {c,g} {f}
On day 3: {a,g,d} {b} {c} {e,f}

The amount of variables stays the same, but the cluster sizes and count varies.

Obviously the grouping is less consistent in the latter examples than in the first one. Ideally I'd like to have a measure that assigns a value of 1.0 to a completely consistent method and 0.0 where the clustering seems random. I'm struggling to find any literature and pointers to how this can be achieved. How could it be done?

Background

In my case the clusters are on correlation matrices of financial instruments.

Best Answer

I suggest to use either the Variation of Information or the split/join measure. These are both metric distances on the space of partitions, and have the property that they will be 0 for identical partitions and get larger as partitions become more different. Further information is available here:

Comparing clusterings: Rand Index vs Variation of Information

There is no reason at all to use some pseudo-statistical measure when in fact the space of partitions can be equipped with a metric distance (several in fact). Something to be weary of are measures that are very much affected by the size of the cluster sizes (i.e. a node change is weighted differently depending on the sizes of the clusters involved). The Rand index (and associated Mirkin distance) are especially bad in this respect.

Related Question