Solved – Comparing a clustering algorithm partition to a “ground truth” one

clusteringvalidation

I have a dataset $X$. Each sample of $X$ has a label $y$ that induce a partition $P$ of $k$ subsets of $X$.

If I feed a clustering algorithm with $X$, asking for $k$ clusters I would like to obtain a partition of the samples of $X$ that is the same of that induced by $y$, that is $P$.

I want to compare the partition generated by the clustering algorithm with the ground-truth partition $P$.

To do this, I can not compare the labels $y$ with the cluster codes of a sample, as unfortunately they are totally mismatched (as the label assignment is totally arbitrary).

Is there any known technique to perform this task?

Best Answer

The Adjusted Rand index could work. It's a popular method for measuring the similarity of two ways of assigning discrete labels to the data, ignoring permutations of the labels themselves. Instead of checking whether the raw class/cluster labels match, you'd look at pairs of points and ask: to what extent are pairs in the same class assigned to the same cluster, and pairs in different classes assigned to different clusters?

To compute the Rand index, you'd measure:

  • $a$ = Number of pairs that have the same class label and same cluster assignment
  • $b$ = Number of pairs that have different class labels and different cluster assignments

The raw Rand index is:

$$RI = \frac{a + b}{\binom{n}{2}}$$

where $\binom{n}{2}$ is the number of possible pairs of points. $RI$ ranges from 0 to 1, with 1 indicating total agreement.

However, a random assignment of labels probably wouldn't produce a Rand index of zero. Therefore, it's better to use the adjusted Rand index (ARI), which makes it easier to identify this type of null result. ARI ranges from -1 to 1, where negative and near-zero values indicate chance-level labelings, positive values indicate similar labelings, and 1 indicates perfect agreement.

You can also take a look at other clustering performance metrics here. The metrics that might be useful to you are the ones that compare cluster assignments to ground truth labels (i.e. your class labels): normalized/adjusted mutual information, homogeneity/completeness/v-measure, Fowlkes-Mallows score.

Related Question