Solved – Comparing results of unsupervised clustering to a known classification

classificationclusteringneuroscience

Disclaimer: I'm looking for a bit of help as I'm only a simple neuroscientist and even working out what to google in this area is a tricky prospect. Here goes:

I have a set of data (3d positions in the brain). This can be allocated to known brain areas. What I want to know is – do these positions cluster nicely in to previously described brain regions, or not?

As an analogy: Imagine you have the longitude and latitude of every home in Europe. You want to understand where people live. You can simply look up the country or state/county/district within a country in which any given home is located.

If you run a cluster analysis, you'll find clusters like London that correspond entirely to one country – UK – but to multiple counties within the country – Essex, Hertfordshire etc. The city of Basel is nominally a Swiss city, but with suburbs in France and Germany. So in these cases, the cluster (the city) won't correspond well to a single classification (the country). In contrast, a city such as Bath is located in the UK, and also entirely within one subregion – Somerset

I'm looking for a way to quantify this discordance. To be clear, I don't want to train a supervised ML algorithm to recapitulate the classification, but rather to find out how an unsupervised clustering matches up.

Thanks

Best Answer

I'm a machine learning scientist turned neuroscientist, so hopefully we'll be able to sort something out. There are basically two options here:

Option A: Direct cluster similarity estimation

There are some algorithms that can give you a direct similarity measure between two clusterings of the same data (the real anatomical regions on one side, the outcome of your unsupervised algorithm on the other). With this option you wouldn't know which region each cluster corresponds to, but you would get an absolute measure of similarity.

There are several approaches, but a simple one is to just calculate the mutual information between them -- the higher it is, the more similar the clusterings are.

Here are a couple of papers: this one with some simple and effective methods and this one with a review and comparison of several approaches.

Option B: Classification via clustering

Alternatively, you can split the process in two parts: 1) find a mapping between your true labels and your unsupervised cluster memberships; and 2) calculate how well those match as a standard classification evaluation. The advantage of this option is that you get a better grasp on what your unsupervised algorithm is doing, the disadvantage is that it's not as principled as the end-to-end solutions from option A.

Let's look at (2) first. There are piles of literature published on this which I can't possibly fit in a SE answer. I'll point you to the relevant Wikipedia section and informally suggest the Rand index as a reasonable candidate, but of course there are many more.

Now back to (1). If you can afford it (i.e. you're handling a relatively small number of categories), the exhaustive brute force approach is to just try all possible combinations and pick the one that maximises your metric of choice from step (2). If that's too expensive, you can do something simpler like majority voting: For each cluster your unsupervised algorithm spits out, pick the label that is most highly represented in that cluster and assign that label to the whole cluster.

Of course there are plenty variables and constraints that can heavily alter the problem, such as whether your unsupervised clustering algorithm gives you probabilities or "hard" cluster assignments, or whether you want to enforce that each label be represented by exactly one cluster. Hopefully now you have a few more keywords you can try searching to find what you need.