Solved – Adjusted Rand Index with different size inputs

clusteringpartitioningpython

Forgive me if my terminology is off, this concept is somewhat new to me.

I'm trying to run an ARI calculation between two partitions which contain a different number of clusters, where each cluster can also vary in size. I'm not really sure what is the correct approach to take when running the calculation.

A bit of background…

I am running this in python and I have two lists:

List1 = [[41, 42], [145, 146, 206], [155, 216], [208, 209]...] etc

List2 = [[637, 698, 759], [696, 757, 818], [811, 872], [881, 942, 941, 1003, 943, 944, 1005, 1004, 1002]...] etc

len(List1) = 152
len(List2) = 106

The numbers in each sublist essentially correspond to grid cell ID's, where the ID's for List1 and List2 come from the same grid with dimensions (90x61). So there could be some clusters which contain similar groupings of ID's and hence why I want to do an ARI.

The problem I have is that the lists only represent a sub-sample of the entire grid and the length of these lists differs. List1 and List2 contain 1378 and 817 points respectively out of the total 5490 available.

Now because the lengths are different, unfortunately I cannot just run the following:

from sklearn.metrics.cluster import adjusted_rand_score
ARI = adjusted_rand_score(List1,List2)

As I get an error:

labels_true and labels_pred must have same size, got 152 and 106

So my Question:

What would be the most mathematically sound approach to make List1 and List2 the same size for the ARI calculation? Would making 46 more dummy clusters in List2, filled with arbitrary grid cells from the main grid which aren't yet in either list be silly?

Many thanks for your help!

Best Answer

The sklearn implementations do not expect sets, but they expect an array of labels. This is more in line with the classification-oriented API of sklearn.

Furthermore, both need to include the same points. You probably should consider all remaining points unclustered, and assign them a unique label (as far as I know, sklearn does not support "noise" labels).

So you need to transform your data.