Solved – Evaluation of Clustering method

clusteringmodel-evaluation

I'm currently confused on choosing the method for evaluating different clustering techniques. From this paper, they followed the pipeline: use Hungarian assignment for matching the cluster with true label, then calculate F1, Precision, Recall like classification problem. But other papers mentioned about Adjusted Rand Index (ARI).

I have tried both method, and I found that most of the case, the ARI give higher score than the former method. Especially, there's one case that ARI gives 0.96 score, which is close to perfect matching, but F1 gives only 0.50, which is very bad clustering.

I wonder that is there any 'official' comparison between using the two methods, and which one is more reliable/widely acknowledged?

Best Answer

You must not and cannot compare scores from evaluation method A to scores from evaluation method B.

In particular, ARI is adjusted for chance - a random result will score 0 - as oppose to the regular Rand index; which in turn is the accuracy of the pairs. But precision/recall/f1 on a balanced k-class problem will be around 1/k for a random result. So a higher score in measure A does not mean the result is better than a lower score with measure B!

The standard measures for evaluating clusterings with respect to existing labels (the usual caveats apply) in literature are ARI and NMI (usually in the sqrt version, but probably AMI would be the better choice here, too).

Matching with the Hungarian algorithm is possible, but even more assumes that there "must" be a 1:1 correspondence of clusters and classes - an assumption that has repeatedly proven wrong, classes can have substructures that are not labeled. It may be reasonable if the application requires a precisely defined number of clusters (e.g., in signal transmission, when you know there must be 10 different signals on the wire; there won't be 11).

Furthermore, ARI and NMI are more robust to changes in the labeling. Relabeling a single point could mean the Hungarian matching algorithm produces a very different cluster mapping. So I'd argue the ARI and NMI results likely are more reliable.

Related Question