Clustering – How to Validate Clustering Results with Labeled Data

clusteringvalidation

I am working on a clustering algorithm and would like to validate its performance against a well-known and used dataset: the KDD-CUP 99 dataset (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). With this dataset, both unlabeled and labeled test data is provided. My question is, how should I validate my clustering algorithm's performance?

Let's say the results of my algorithm are as follows:
x1 -> cluster A
x2 -> cluster A
x3 -> cluster B
x4 -> cluster A

And let's say the labels provided are as follows:
x1 -> cluster 1
x2 -> cluster 1
x3 -> cluster 1
x4 -> cluster 2

Given that the cluster labels are completely different, how should I compare these? In this case, an obvious assumption would be to say that cluster A is probably the same as cluster 1, but this may not always be this obvious. Is there any standardized way to evaluate such situations?

Best Answer

Look into distances between clusterings. They all use what is called the confusion matrix between two clusterings. Well known are the Rand index and the adjusted Rand index, but I generally recommend using either Variation of Information or the not well known split-join distance (see e.g. Comparing clusterings: Rand Index vs Variation of Information and How to interpret these indices/metrics for comparing partitions intuitively out of these images? for more discussion).