Recently I have been working on a scientific paper about clustering, in which I use two extrinsic evaluation metrics to evaluate the clusterings: $B^3 F$ score and $ARI$. The former score is much more informative than the latter and delivers remarkably higher values. On the other hand, $ARI$ seems to fit my dataset in which $k \approx N$ where $k$ is the number of clusters (six on average) and $N$ the data points (20 on average).
However, my supervisory professor suggests omitting $ARI$ because it is insensitive and thus uninformative; for instance, even if I supply the true $k$ to the algorithms it won't change much. Overall $ARI$ produces mediocre results of around $0.1$, which is pretty low.
Upon reading about the limitations of $ARI$ and adjusted measures in general, it was brought to my attention that:
- The analytical formula of $ARI$ was derived based on the assumption of hyper-geometric model of randomness, which poses some tight constraints that are almost never satisfied by the outputs of a clustering algorithm [1];
- Moreover, Meila [2] explained that all adjusted indices, including ARI, are non-local, which means a variation in one of the clusters would be considered differently depending on how the remaining clusters are formed.
So could you please further elaborate on the situations in which the usage of $ARI$ isn't advocated?
[1] http://www.jmlr.org/papers/volume11/vinh10a/vinh10a.pdf
[2] https://www.stat.washington.edu/mmp/Papers/icml05-compare-axioms.pdf
Best Answer
First of all, ARI is one of the standard measures used everywhere. So if you omit it, there is a good chance reviewers will demand that you add it, and reject your paper. B³ is a fairly exotic measure that I have not seen anybody actually use.
It is meaningless if the values of measure A are higher than those of measure B. Because that is apples and oranges. Conversely, some fairly simple transformations can affect these values. For example. The Rand index (RI) will always be higher than ARI, despite them measuring the same quantity, because ARI take the RI relative to an expected value. So B³>ARI is a useless observation, you must never compare different measures. Do you actually observe different rankings? I.e. give two results R1, R2, do you observe $B³(R1)\gg B³(R2)$ but $ARI(R1)\ll ARI(R2)$ on the same data set?
Which essentially brings us to a point why you should pay attention to ARI. If your result has an ARI close to 0 (such as 0.1), then it means your result will be almost as good if you randomly permute all labels. But then how can it be any good? Does B³ do any such adjustment? What is the B³ of a random result?
It is correct that ARI is not perfect. But are the other measures actually better there? I doubt so. Yes, the adjustment of ARI uses a fairly simple assumption for adjustment (that breaks down for constant labels), but it serves the purpose of making the Rand index more interpretable well. For some uncommon special cases, it may be desirable to use different adjustments. In such cases, I would then suggest to use the minimum of all adjustments. So your score can only become worse.
Non-locality is an unfortunate property but not crucial for most users. It this mostly is relevant for hierarchical approaches: while you can locally evaluate Rand to check if a split yields better results, you cannot find the best agreement with respect to ARI that easily. The reasonable approach in this situation is of course to maximize the Rand index, and then adjust this final result.
In conclusion, I suggest that you: