Solved – When NOT to use the Adjusted Rand Index (ARI)

clustering

Recently I have been working on a scientific paper about clustering, in which I use two extrinsic evaluation metrics to evaluate the clusterings: $B^3 F$ score and $ARI$. The former score is much more informative than the latter and delivers remarkably higher values. On the other hand, $ARI$ seems to fit my dataset in which $k \approx N$ where $k$ is the number of clusters (six on average) and $N$ the data points (20 on average).

However, my supervisory professor suggests omitting $ARI$ because it is insensitive and thus uninformative; for instance, even if I supply the true $k$ to the algorithms it won't change much. Overall $ARI$ produces mediocre results of around $0.1$, which is pretty low.

Upon reading about the limitations of $ARI$ and adjusted measures in general, it was brought to my attention that:

The analytical formula of $ARI$ was derived based on the assumption of hyper-geometric model of randomness, which poses some tight constraints that are almost never satisfied by the outputs of a clustering algorithm [1];
Moreover, Meila [2] explained that all adjusted indices, including ARI, are non-local, which means a variation in one of the clusters would be considered differently depending on how the remaining clusters are formed.

So could you please further elaborate on the situations in which the usage of $ARI$ isn't advocated?

[1] http://www.jmlr.org/papers/volume11/vinh10a/vinh10a.pdf
[2] https://www.stat.washington.edu/mmp/Papers/icml05-compare-axioms.pdf

Best Answer

First of all, ARI is one of the standard measures used everywhere. So if you omit it, there is a good chance reviewers will demand that you add it, and reject your paper. B³ is a fairly exotic measure that I have not seen anybody actually use.

It is meaningless if the values of measure A are higher than those of measure B. Because that is apples and oranges. Conversely, some fairly simple transformations can affect these values. For example. The Rand index (RI) will always be higher than ARI, despite them measuring the same quantity, because ARI take the RI relative to an expected value. So B³>ARI is a useless observation, you must never compare different measures. Do you actually observe different rankings? I.e. give two results R1, R2, do you observe $B³(R1)\gg B³(R2)$ but $ARI(R1)\ll ARI(R2)$ on the same data set?

Which essentially brings us to a point why you should pay attention to ARI. If your result has an ARI close to 0 (such as 0.1), then it means your result will be almost as good if you randomly permute all labels. But then how can it be any good? Does B³ do any such adjustment? What is the B³ of a random result?

It is correct that ARI is not perfect. But are the other measures actually better there? I doubt so. Yes, the adjustment of ARI uses a fairly simple assumption for adjustment (that breaks down for constant labels), but it serves the purpose of making the Rand index more interpretable well. For some uncommon special cases, it may be desirable to use different adjustments. In such cases, I would then suggest to use the minimum of all adjustments. So your score can only become worse.

Non-locality is an unfortunate property but not crucial for most users. It this mostly is relevant for hierarchical approaches: while you can locally evaluate Rand to check if a split yields better results, you cannot find the best agreement with respect to ARI that easily. The reasonable approach in this situation is of course to maximize the Rand index, and then adjust this final result.

In conclusion, I suggest that you:

Always give the standard ARI index. A low ARI does indicate a poor result.
If you want, add additional adjustments with other null models.
Give the NMI, because that is the other standard measure (it also has issues; in particular a high NMI does not guarantee a good result, because a random result can score high). Be explicit about which version of NMI you use!
Give the AMI, the adjusted mutual information. This is a similar adjustment that aims at normalizing a random MI to be 0.
Include trivial baselines, such as random permutation of the true labels, all objects labeled 0, random subsampling of k "centers" and assigning each to its nearest "center" etc. If you can't substantially beat these baselines, your method is not good (and you will be surprised by how common this is!)

Short answer

Use ARI when the ground truth clustering has large equal sized clusters
Use AMI when the ground truth clustering is unbalanced and there exist small clusters

Longer answer

I worked on this topic. Reference: Adjusting for Chance Clustering Comparison Measures

A one-line summary of the paper is: AMI is high when there are pure clusters in the clustering solution.

Let's have a look at an example. We have a reference clustering V consisting of 4 equal size clusters. Each cluster is of size 25. Then we have two clustering solutions:

U1 that has pure clusters (many zeros in the contingency table)
U2 that has impure clusters

AMI will choose U1 and ARI will choose U2.

Eventually:

U1 is unbalanced. Unbalanced clusters have more chances to present pure clusters. AMI is biased towards unbalanced clustering solutions
U2 is balanced. ARI is biased towards balanced clustering solutions.

If we are using external validity indices such as AMI and ARI, we are aiming at matching the reference clustering with our clustering solution. This is why the recommendation at the top: AMI when the reference clustering is unbalanced, and ARI when the reference clustering is balanced. We do this mainly due to the biases in both measures.

Also, when we have an unbalanced reference clustering with small clusters, we are even more interested in generating pure small clusters in the solution. We want to identify precisely the small clusters from the reference. Even a single mismatched data point can have a relatively higher impact.

Other than the recommendations above, we could use AMI when we are interested in having pure clusters in the solution.

Experiment

Here I sketched an experiment where P generates solutions U which are balanced when P=1 and unbalanced when P=0. You can play with the notebook here.

Best Answer

Related Solutions

Clustering – How to Calculate the Adjusted Rand Index

Clustering in Python – Adjusted Rand Index vs Adjusted Mutual Information Explained

Short answer

Longer answer

Experiment

Related Question