Similarity Measures – How to Compare More Than 2 Variables Using Jaccard Similarity and Other Methods

binary datadistancejaccard-similarityrsimilarities

If I have two binary variables, I can determine the similarity of these variables quite easily with different similarity measures, e.g. with the Jaccard similarity measure:

$J = \frac{M_{11}}{M_{01} + M_{10} + M_{11}}$

Example in R:

# Example data
N <- 1000
x1 <- rbinom(N, 1, 0.5)
x2 <- rbinom(N, 1, 0.5)

# Jaccard similarity measure
a <- sum(x1 == 1 & x2 == 1)
b <- sum(x1 == 1 & x2 == 0)
c <- sum(x1 == 0 & x2 == 1)

jacc <- a / (a + b + c)
jacc

However, I have a group of 1.000 binary variables and want to determine the similarity of the whole group.

Question: What is the best way to determine the similarity of more than 2 binary variables?

One idea is to measure the similarity for each pairwise combination and then take the average. You can find an example of this procedure below:

# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)
list_x <- split(mat_x, rep(1:ncol(mat_x), each = nrow(mat_x)))

# Function for calculation of Jaccard similarity
fun_jacc <- function(v1, v2) {

  a <- sum(v1 == 1 & v2 == 1)
  b <- sum(v1 == 1 & v2 == 0)
  c <- sum(v1 == 0 & v2 == 1)

  jacc <- a / (a + b + c)
  return(jacc)
}

# Apply function to all combinations (takes approx. 1 min to calculate)
mat_jacc <- sapply(list_x, function(x) sapply(list_x, function(y) fun_jacc(x,y)))
mat_jacc[upper.tri(mat_jacc)] <- NA
diag(mat_jacc) <- NA
vec_jacc <- as.vector(mat_jacc)
vec_jacc <- vec_jacc[!is.na(vec_jacc)]
median(vec_jacc)

This is very inefficient though and I am also not sure if this is theoretically the best way to measure the similarity of such a group of variables.

UPDATE: According to user43849's suggestion the dissimilarity could be calculated with Sorensen's multiple-site dissimilarity:

# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)

# Multiple-site dissimilarity according to Sorensen
library("betapart")
beta.multi(t(mat_x), index.family = "sor")$beta.SOR # Vectors are not similar --> almost 1

Best Answer

This answer will draw heavily on the ecological literature, where Jaccard and other (dis)similarity measures are commonly used to quantify the compositional (dis)similarity between species assemblages at different sites. The single best reference is Baselga (2013) Multiple site dissimilarity quantifies compositional heterogeneity among several sites, while average pairwise dissimilarity may be misleading, which is freely available here.

Basically, there are several approaches to quantifying higher-order dissimilarities (higher-order than pairwise). One is to average the pairwise dissimilarities for all pairs in the sample. This metric generally performs poorly for a variety of reasons, detailed in Baselga (2013). Another possibility is to find the average distance from an observation to the multivariate centroid.

There is an explicit generalization of the Sorensen index to more than two observations. Recall that the Sorensen index is $\frac{2ab}{a+b}$ where a is the number of species (ones in your case) in sample A, b is the number of species in sample B, and ab is the number of species shared by samples A and B (i.e. the dot product). The three-site generalization, formulated by Diserud and Odegaard (2007) and discussed by Chao et al (2012) is $\frac{3}{2}\frac{ab+ac+bc-abc}{a+b+c}$. Consult Diserud and Odegaard (2007) for the motivation behind this metric as well as extensions to $N>3$. The references in Baselga (2013) will also point you to a multi-site generalization of the Simpson index, as well as R packages to compute the multi-site Sorensen and Simpson metrics.

Some researchers have also found it useful to examine the average number of species shared by $i$ sites, where $i$ ranges from $2$ to $N$. This reveals some interesting scaling properties and unites a variety of concepts for different values of $i$. The key reference here is Hui and McGeoch (2014) available for free here. This paper also has an associated R package called 'zetadiv'.

Related Question