Solved – Testing whether two datasets cluster similarly

clusteringhierarchical clustering

Most questions about cluster analysis seem to come from people who have a single dataset and want to compare/quantify the similarity of different clustering approaches. This question is not that. Instead, my goal is to take two separate datasets, apply the same clustering technique, and then compare/quantify the similarity of the resulting clusters.

I'll make it a bit more concrete. Let's say I'm taking a survey, and I want to use hierarchical cluster analysis to reveal the latent groupings in the data. Survey A takes an hour to administer, whereas Survey B takes 5 minutes to administer. I can probably assume that the data obtained from Survey A are a better estimate of the real world, but I want to know how well Survey B stacks up. Clearly the actual numbers are going to differ, but if both surveys yield the same clusters, then it's probably better to just use the shorter one.

So the big question is: what's an appropriate metric for measuring how different two sets of clusters are? I've had a quick read through Comparing Clusters – An Overview, and my first sub-question is whether there's been subsequent development since this paper was written (2007). They tentatively advocate for measures based on mutual information (section 5), but then caution that it's not well worked-out. My second sub-question is whether it's even appropriate to apply these methods to clusters that are based on different datasets.

Best Answer

One problem with your scenario, which is unrelated to clustering, is that you want to check whether your results are similar, and it is always harder to do than testing whether they are different. Most statistical tools work well to detect differences that are larger than a certain threshold, and make a poor argument when you need to claim the opposite.

Here's one idea though. Say, you study your first dataset, and decide that it seems to have 3 clusters. You justify this selection properly, using one of the standard validity tests. Then you run k-clustering (or EM clustering) without randomization and assign each point of dataset 1 to a cluster.

Then you start subsetting your dataset, adding points from dataset 2 to it, one at a time, and running the same clustering algorithm every time. One point would not skew your clusters too much, so essentially you'll mark-up your dataset 2 using clusters defined by dataset 1.

Then you run same clustering algorithm on dataset 2 alone. Now for each point you know in which of 3 clusters it was put based on the dataset 1, and in which one - based on the dataset 2. You can calculate the share of this overlap, which is already telling. You can also calculate the p-value that this overlap is not random (but most probably it won't be random, unless your clusters are really spurious, so this test would not be too useful).

Arguing that the difference is small is harder. But in principle, you'll have to define how you expect your data to be different, if it is different. Then specify what is the biggest difference that you are willing to tolerate. Then somehow generate a set of surrogate datasets that are "different" with your "critical difference", and see how likely you are to get the actually observed overlap between dataset 2 cluster and dataset 1 cluster on a random dataset. But to do it, first and foremost, you need a good, well thought-through $H0$ hypothesis.

You can use other sorts of measurements as well: for example, you could compute the average separation (the mean of all pairwise distances) between the points from "matching clusters" in datasets 1 and 2. Or you could use entropy-based approaches. But regardless of the approach, proving that the clusters are different would be easy, while proving that they are "essentially the same" would require introduction of additional assumptions, and a good solid $H0$.

Related Question