Binary Data – Choosing Jaccard Over Russell and Rao Similarity Coefficients

association-measurebinary datasimilarities

From Encyclopedia of Statistical Sciences I understand that given $p$ dichotomous (binary: 1=present; 0=absent) attributes (variables), we can form a contingency table for any two objects i and j of a sample:

         j
       1   0
      -------
  1  | a | b |
i     -------
  0  | c | d |
      -------
a = number of variables on which both objects i and j are 1
b = number of variables where object i is 1 and j is 0
c = number of variables where object i is 0 and j is 1
d = number of variables where both i and j are 0
a+b+c+d = p, the nubmer of variables.

We can calculate from these values similarity coefficients between any pair of objects, specifically the Jaccard coefficient
$$
\frac{a}{a+b+c}
$$
and the Russell and Rao coefficient
$$
\frac{a}{a+b+c+d} = \frac{a}{p}.
$$

When calculated these coefficients will give different values, but I can't find any resources which explain why I should choose one over the other. Is it just because for some datasets, the simultaneous absence of both attributes ($d$) doesn't convey any information?

Best Answer

There exist many such coefficients (most are expressed here). Just try to meditate on what are the consequences of the differences in formulas, especially when you compute a matrix of coefficients.

Imagine, for example, that objects 1 and 2 similar, as objects 3 and 4 are. But 1 and 2 have many of the attributes on the list while 3 and 4 have only few attributes. In this case, Russell-Rao (proportion of co-attributes to the total number of attributes under consideration) will be high for pair 1-2 and low for pair 3-4. But Jaccard (proportion of co-attributes to the combined number of attributes both objects have = probability that if either object has an attribute then they both have it) will be high for both pairs 1-2 and 3-4.

This adjustment for the base level of "saturation by attributes" makes Jaccard so popular and more useful than Russell-Rao, e.g. in cluster analysis or multidimensional scaling. You might, in a sense, further refine the above adjustment by selecting Kulczynski-2 measure which is the arithmetic mean probability that if one object has an attribute, the other object has it too: $$ (\frac{a}{a+b} + \frac{a}{a+c}) /2 $$ Here the base (or field) of attributes for the two objects is not pooled, as in Jaccard, but is own for each of the two objects. Consequently, if the objects differ greatly on the number of attributes they have, and all its attributes the "poorer" object shares with the "richer" one, Kulczynski will be high whereas Jaccard will be moderate.

Or you could prefer to compute geometric mean probability that if one object has an attribute, the other object has it too, which yields Ochiai measure: $$ \sqrt {\frac{a}{a+b} \frac{a}{a+c}} $$ Because product increases weaker than sum when only one of the terms grows, Ochiai will be really high only if both of the two proportions (probabilities) are high, which implies that to be considered similar by Ochiai the objects must share the great shares of their attributes. In short, Ochiai curbs similarity if $b$ and $c$ are unequal. Ochiai is in fact the cosine similarity measure (and Russell-Rao is the dot product similarity).

P.S.

Is it just because for some datasets, the simultaneous absence of both attributes (d) doesn't convey any information?

Speaking of similarity measures, one shouldn't mix nominal dichotomous attributes (e.g. female, male) with binary attributes (present vs absent). Binary attribute isn't symmetric (in general), - if you and I share a characteristic, it is the basis for calling us similar; if you and I both miss the characteristic, it may or may not be considered the evidence of similarity, depending on the context of the study. Hence the divergent treatment of $d$ is possible.

Note also that if you wish to compute similarity between objects based on 1+ nominal attributes (dichotomous or polytomous), recode each such variable into the set of dummy binary variables. Then the recommended similarity measure to compute will be Dice (which, when computed for 1+ sets of dummy variables, is equivalent to Ochiai and Kulczynski-2).

Related Solutions

Solved – Calculating Jaccard or other association coefficient for binary data using matrix multiplication

We know that Jaccard (computed between any two columns of binary data $\bf{X}$) is $\frac{a}{a+b+c}$, while Rogers-Tanimoto is $\frac{a+d}{a+d+2(b+c)}$, where

a - number of rows where both columns are 1
b - number of rows where this and not the other column is 1
c - number of rows where the other and not this column is 1
d - number of rows where both columns are 0

$a+b+c+d=n$, the number of rows in $\bf{X}$

Then we have:

$\bf X'X=A$ is the square symmetric matrix of $a$ between all columns.

$\bf (not X)'(not X)=D$ is the square symmetric matrix of $d$ between all columns ("not X" is converting 1->0 and 0->1 in X).

So, $\frac{\bf A}{n-\bf D}$ is the square symmetric matrix of Jaccard between all columns.

$\frac{\bf A+D}{\bf A+D+2(n-(A+D))}=\frac{\bf A+D}{2n-\bf A-D}$ is the square symmetric matrix of Rogers-Tanimoto between all columns.

I checked numerically if these formulas give correct result. They do.

Upd. You can also obtain matrices $\bf B$ and $\bf C$:

$\bf B= [1]'X-A$, where "[1]" denotes matrix of ones, sized as $\bf X$. $\bf B$ is the square asymmetric matrix of $b$ between all columns; its element ij is the number of rows in $\bf X$ with 0 in column i and 1 in column j.

Consequently, $\bf C=B'$.

Matrix $\bf D$ can be also computed this way, of course: $n \bf -A-B-C$.

Knowing matrices $\bf A, B, C, D$, you are able to calculate a matrix of any pairwise (dis)similarity coefficient invented for binary data.

Correlation – Identifying the Name of This Correlation/Association Measure Between Binary Variables

Using a,b,c,d convention of the 4-fold table, as here,

               Y
             1   0
            -------
        1  | a | b |
     X      -------
        0  | c | d |
            -------
a = number of cases on which both X and Y are 1
b = number of cases where X is 1 and Y is 0
c = number of cases where X is 0 and Y is 1
d = number of cases where X and Y are 0
a+b+c+d = n, the number of cases.

substitute and get

$1-\frac{2(b+c)}{n} = \frac{n-2b-2c}{n} = \frac{(a+d)-(b+c)}{a+b+c+d}$ = Hamann similarity coefficient. Meet it e.g. here. To cite:

Hamann similarity measure. This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of −1 to +1 and is monotonically related to Simple Matching similarity (SM), Sokal & Sneath similarity 1 (SS1), and Rogers & Tanimoto similarity (RT).

You might want to compare the Hamann formula with that of phi correlation (that you mention) given in a,b,c,d terms. Both are "correlation" measures - ranging from -1 to 1. But look, Phi's numerator $ad-bc$ will approach 1 only when both a and d are large (or likewise -1, if both b and c are large): product, you know... In other words, Pearson correlation, and especially its dichotomous-data hypostasis, Phi, is sensitive to the symmetry of marginal distributions in the data. Hamann's numerator $(a+d)-(b+c)$, having sums in place of products, is not sensitive to it: either of two summands in a pair being large is enough for the coefficient to attain close to 1 (or -1). Thus, if you want a "correlation" (or quasi-correlation) measure defying marginal distributions shape - choose Hamann over Phi.

Illustration:

Crosstabulations:
        Y
X    7     1
     1     7
Phi = .75; Hamann = .75

        Y
X    4     1
     1    10
Phi = .71; Hamann = .75

Best Answer

Related Solutions

Solved – Calculating Jaccard or other association coefficient for binary data using matrix multiplication

Correlation – Identifying the Name of This Correlation/Association Measure Between Binary Variables

Related Question