Solved – Most well-known set-similarity measures

method-comparisonsimilarities

I know of the Jaccard index and the Sørensen-Dice coefficient for computing set similarity, but have been unable to find any other algorithms related to set similarity. This site contains quite a few resources for vector similarity, but that's not what I want.

What other set-similarity measures exist?

Best Answer

Other measures are:

Overlap Coefficient: $\frac{|A \cap B|}{min(|A|,|B|)}$
Tversky index: $|A\cap B| + \alpha|A\setminus B| + \beta|B \setminus A|$ where $\alpha$ and $\beta$ are positive numbers.

Related Solutions

Binary Data – Choosing Jaccard Over Russell and Rao Similarity Coefficients

There exist many such coefficients (most are expressed here). Just try to meditate on what are the consequences of the differences in formulas, especially when you compute a matrix of coefficients.

Imagine, for example, that objects 1 and 2 similar, as objects 3 and 4 are. But 1 and 2 have many of the attributes on the list while 3 and 4 have only few attributes. In this case, Russell-Rao (proportion of co-attributes to the total number of attributes under consideration) will be high for pair 1-2 and low for pair 3-4. But Jaccard (proportion of co-attributes to the combined number of attributes both objects have = probability that if either object has an attribute then they both have it) will be high for both pairs 1-2 and 3-4.

This adjustment for the base level of "saturation by attributes" makes Jaccard so popular and more useful than Russell-Rao, e.g. in cluster analysis or multidimensional scaling. You might, in a sense, further refine the above adjustment by selecting Kulczynski-2 measure which is the arithmetic mean probability that if one object has an attribute, the other object has it too: $$ (\frac{a}{a+b} + \frac{a}{a+c}) /2 $$ Here the base (or field) of attributes for the two objects is not pooled, as in Jaccard, but is own for each of the two objects. Consequently, if the objects differ greatly on the number of attributes they have, and all its attributes the "poorer" object shares with the "richer" one, Kulczynski will be high whereas Jaccard will be moderate.

Or you could prefer to compute geometric mean probability that if one object has an attribute, the other object has it too, which yields Ochiai measure: $$ \sqrt {\frac{a}{a+b} \frac{a}{a+c}} $$ Because product increases weaker than sum when only one of the terms grows, Ochiai will be really high only if both of the two proportions (probabilities) are high, which implies that to be considered similar by Ochiai the objects must share the great shares of their attributes. In short, Ochiai curbs similarity if $b$ and $c$ are unequal. Ochiai is in fact the cosine similarity measure (and Russell-Rao is the dot product similarity).

P.S.

Is it just because for some datasets, the simultaneous absence of both attributes (d) doesn't convey any information?

Speaking of similarity measures, one shouldn't mix nominal dichotomous attributes (e.g. female, male) with binary attributes (present vs absent). Binary attribute isn't symmetric (in general), - if you and I share a characteristic, it is the basis for calling us similar; if you and I both miss the characteristic, it may or may not be considered the evidence of similarity, depending on the context of the study. Hence the divergent treatment of $d$ is possible.

Note also that if you wish to compute similarity between objects based on 1+ nominal attributes (dichotomous or polytomous), recode each such variable into the set of dummy binary variables. Then the recommended similarity measure to compute will be Dice (which, when computed for 1+ sets of dummy variables, is equivalent to Ochiai and Kulczynski-2).

Similarity Measures – How to Compare More Than 2 Variables Using Jaccard Similarity and Other Methods

This answer will draw heavily on the ecological literature, where Jaccard and other (dis)similarity measures are commonly used to quantify the compositional (dis)similarity between species assemblages at different sites. The single best reference is Baselga (2013) Multiple site dissimilarity quantifies compositional heterogeneity among several sites, while average pairwise dissimilarity may be misleading, which is freely available here.

Basically, there are several approaches to quantifying higher-order dissimilarities (higher-order than pairwise). One is to average the pairwise dissimilarities for all pairs in the sample. This metric generally performs poorly for a variety of reasons, detailed in Baselga (2013). Another possibility is to find the average distance from an observation to the multivariate centroid.

There is an explicit generalization of the Sorensen index to more than two observations. Recall that the Sorensen index is $\frac{2ab}{a+b}$ where a is the number of species (ones in your case) in sample A, b is the number of species in sample B, and ab is the number of species shared by samples A and B (i.e. the dot product). The three-site generalization, formulated by Diserud and Odegaard (2007) and discussed by Chao et al (2012) is $\frac{3}{2}\frac{ab+ac+bc-abc}{a+b+c}$. Consult Diserud and Odegaard (2007) for the motivation behind this metric as well as extensions to $N>3$. The references in Baselga (2013) will also point you to a multi-site generalization of the Simpson index, as well as R packages to compute the multi-site Sorensen and Simpson metrics.

Some researchers have also found it useful to examine the average number of species shared by $i$ sites, where $i$ ranges from $2$ to $N$. This reveals some interesting scaling properties and unites a variety of concepts for different values of $i$. The key reference here is Hui and McGeoch (2014) available for free here. This paper also has an associated R package called 'zetadiv'.

Best Answer

Related Solutions

Binary Data – Choosing Jaccard Over Russell and Rao Similarity Coefficients

Similarity Measures – How to Compare More Than 2 Variables Using Jaccard Similarity and Other Methods

Related Question