Solved – confusion matrix and Jaccard index computation in O(n) for cluster comparison

clusteringcomputational-statisticsjaccard-similarity

My problem:
I have two clustering (or partitions) of a set of numbers $\{1,…,n\}$, each with $k$ cluster (or subsets), that i have to compare and, to do that, i'd like to try some different indexes like Jaccard, Mirkin and the Variation of Information, so i need to compute a confusion matrix. Given that labels are random in 1,…,k, a naive implementation to compute a confusion matrix like this

enter image description here

would be to compute for every $i,j \in [1,…,k]$ the number of co-clustered pairs of elements, but this algorithm is $O(k^2) $

I've read a few papers where the authors say that confusion matrix can be calculated in $O(n)$. In Integrating Microarray Data by Consensus Clustering, for example, they choose a symmetric difference distance distance that they can compute from $b$ and $c$ values of a cunfusion matrix, values that can be computed in $O(n)$ time. How can they do that ?

Best Answer

Assuming you have $O(1)$ lookups of cluster memberships, you can iterate of all objects, query both cluster memberships, and tabulate this in a $k^2$ matrix (counting the frequency of object being in cluster $i$ resp. $j$). This part is $O(n)$. From this $k\times k$ matrix, you can efficiently compute the $2 \times 2$ "pair confusion" matrix. This part is $O(k^2)$.

The total algorithm thus is $O(n\cdot k^2)$ and if we assume $k^2\ll n$ to be neglibile then we have $O(n)$.

Related Question