Solved – Calculating Jaccard or other association coefficient for binary data using matrix multiplication

association-measurebinary datamatrixrsimilarities

I want to know if is there any possible way to calculate Jaccard coefficient using matrix multiplication.

I used this code

    jaccard_sim <- function(x) {
    # initialize similarity matrix
    m <- matrix(NA, nrow=ncol(x),ncol=ncol(x),dimnames=list(colnames(x),colnames(x)))
    jaccard <- as.data.frame(m)

    for(i in 1:ncol(x)) {
     for(j in i:ncol(x)) {
        jaccard[i,j]= length(which(x[,i] & x[,j])) / length(which(x[,i] | x[,j]))
        jaccard[j,i]=jaccard[i,j]        
       }
     }

This is quite ok to implement in R. I have done the dice similarity one, but got stuck with Tanimoto/Jaccard. Anybody can help?

Best Answer

We know that Jaccard (computed between any two columns of binary data $\bf{X}$) is $\frac{a}{a+b+c}$, while Rogers-Tanimoto is $\frac{a+d}{a+d+2(b+c)}$, where

  • a - number of rows where both columns are 1
  • b - number of rows where this and not the other column is 1
  • c - number of rows where the other and not this column is 1
  • d - number of rows where both columns are 0

$a+b+c+d=n$, the number of rows in $\bf{X}$

Then we have:

$\bf X'X=A$ is the square symmetric matrix of $a$ between all columns.

$\bf (not X)'(not X)=D$ is the square symmetric matrix of $d$ between all columns ("not X" is converting 1->0 and 0->1 in X).

So, $\frac{\bf A}{n-\bf D}$ is the square symmetric matrix of Jaccard between all columns.

$\frac{\bf A+D}{\bf A+D+2(n-(A+D))}=\frac{\bf A+D}{2n-\bf A-D}$ is the square symmetric matrix of Rogers-Tanimoto between all columns.

I checked numerically if these formulas give correct result. They do.


Upd. You can also obtain matrices $\bf B$ and $\bf C$:

$\bf B= [1]'X-A$, where "[1]" denotes matrix of ones, sized as $\bf X$. $\bf B$ is the square asymmetric matrix of $b$ between all columns; its element ij is the number of rows in $\bf X$ with 0 in column i and 1 in column j.

Consequently, $\bf C=B'$.

Matrix $\bf D$ can be also computed this way, of course: $n \bf -A-B-C$.

Knowing matrices $\bf A, B, C, D$, you are able to calculate a matrix of any pairwise (dis)similarity coefficient invented for binary data.