I have a dataset $X$. Each sample of $X$ has a label $y$ that induce a partition $P$ of $k$ subsets of $X$.
If I feed a clustering algorithm with $X$, asking for $k$ clusters I would like to obtain a partition of the samples of $X$ that is the same of that induced by $y$, that is $P$.
I want to compare the partition generated by the clustering algorithm with the ground-truth partition $P$.
To do this, I can not compare the labels $y$ with the cluster codes of a sample, as unfortunately they are totally mismatched (as the label assignment is totally arbitrary).
Is there any known technique to perform this task?
Best Answer
The Adjusted Rand index could work. It's a popular method for measuring the similarity of two ways of assigning discrete labels to the data, ignoring permutations of the labels themselves. Instead of checking whether the raw class/cluster labels match, you'd look at pairs of points and ask: to what extent are pairs in the same class assigned to the same cluster, and pairs in different classes assigned to different clusters?
To compute the Rand index, you'd measure:
The raw Rand index is:
$$RI = \frac{a + b}{\binom{n}{2}}$$
where $\binom{n}{2}$ is the number of possible pairs of points. $RI$ ranges from 0 to 1, with 1 indicating total agreement.
However, a random assignment of labels probably wouldn't produce a Rand index of zero. Therefore, it's better to use the adjusted Rand index (ARI), which makes it easier to identify this type of null result. ARI ranges from -1 to 1, where negative and near-zero values indicate chance-level labelings, positive values indicate similar labelings, and 1 indicates perfect agreement.
You can also take a look at other clustering performance metrics here. The metrics that might be useful to you are the ones that compare cluster assignments to ground truth labels (i.e. your class labels): normalized/adjusted mutual information, homogeneity/completeness/v-measure, Fowlkes-Mallows score.