Solved – When NOT to use the Adjusted Rand Index (ARI)

clustering

Recently I have been working on a scientific paper about clustering, in which I use two extrinsic evaluation metrics to evaluate the clusterings: $B^3 F$ score and $ARI$. The former score is much more informative than the latter and delivers remarkably higher values. On the other hand, $ARI$ seems to fit my dataset in which $k \approx N$ where $k$ is the number of clusters (six on average) and $N$ the data points (20 on average).

However, my supervisory professor suggests omitting $ARI$ because it is insensitive and thus uninformative; for instance, even if I supply the true $k$ to the algorithms it won't change much. Overall $ARI$ produces mediocre results of around $0.1$, which is pretty low.

Upon reading about the limitations of $ARI$ and adjusted measures in general, it was brought to my attention that:

  • The analytical formula of $ARI$ was derived based on the assumption of hyper-geometric model of randomness, which poses some tight constraints that are almost never satisfied by the outputs of a clustering algorithm [1];
  • Moreover, Meila [2] explained that all adjusted indices, including ARI, are non-local, which means a variation in one of the clusters would be considered differently depending on how the remaining clusters are formed.

So could you please further elaborate on the situations in which the usage of $ARI$ isn't advocated?


[1] http://www.jmlr.org/papers/volume11/vinh10a/vinh10a.pdf
[2] https://www.stat.washington.edu/mmp/Papers/icml05-compare-axioms.pdf

Best Answer

First of all, ARI is one of the standard measures used everywhere. So if you omit it, there is a good chance reviewers will demand that you add it, and reject your paper. B³ is a fairly exotic measure that I have not seen anybody actually use.

It is meaningless if the values of measure A are higher than those of measure B. Because that is apples and oranges. Conversely, some fairly simple transformations can affect these values. For example. The Rand index (RI) will always be higher than ARI, despite them measuring the same quantity, because ARI take the RI relative to an expected value. So B³>ARI is a useless observation, you must never compare different measures. Do you actually observe different rankings? I.e. give two results R1, R2, do you observe $B³(R1)\gg B³(R2)$ but $ARI(R1)\ll ARI(R2)$ on the same data set?

Which essentially brings us to a point why you should pay attention to ARI. If your result has an ARI close to 0 (such as 0.1), then it means your result will be almost as good if you randomly permute all labels. But then how can it be any good? Does B³ do any such adjustment? What is the B³ of a random result?

It is correct that ARI is not perfect. But are the other measures actually better there? I doubt so. Yes, the adjustment of ARI uses a fairly simple assumption for adjustment (that breaks down for constant labels), but it serves the purpose of making the Rand index more interpretable well. For some uncommon special cases, it may be desirable to use different adjustments. In such cases, I would then suggest to use the minimum of all adjustments. So your score can only become worse.

Non-locality is an unfortunate property but not crucial for most users. It this mostly is relevant for hierarchical approaches: while you can locally evaluate Rand to check if a split yields better results, you cannot find the best agreement with respect to ARI that easily. The reasonable approach in this situation is of course to maximize the Rand index, and then adjust this final result.

In conclusion, I suggest that you:

  1. Always give the standard ARI index. A low ARI does indicate a poor result.
  2. If you want, add additional adjustments with other null models.
  3. Give the NMI, because that is the other standard measure (it also has issues; in particular a high NMI does not guarantee a good result, because a random result can score high). Be explicit about which version of NMI you use!
  4. Give the AMI, the adjusted mutual information. This is a similar adjustment that aims at normalizing a random MI to be 0.
  5. Include trivial baselines, such as random permutation of the true labels, all objects labeled 0, random subsampling of k "centers" and assigning each to its nearest "center" etc. If you can't substantially beat these baselines, your method is not good (and you will be surprised by how common this is!)