Binary Data – Choosing Jaccard Over Russell and Rao Similarity Coefficients

association-measurebinary datasimilarities

From Encyclopedia of Statistical Sciences I understand that given $p$ dichotomous (binary: 1=present; 0=absent) attributes (variables), we can form a contingency table for any two objects i and j of a sample:

         j
       1   0
      -------
  1  | a | b |
i     -------
  0  | c | d |
      -------
a = number of variables on which both objects i and j are 1
b = number of variables where object i is 1 and j is 0
c = number of variables where object i is 0 and j is 1
d = number of variables where both i and j are 0
a+b+c+d = p, the nubmer of variables.

We can calculate from these values similarity coefficients between any pair of objects, specifically the Jaccard coefficient
$$
\frac{a}{a+b+c}
$$
and the Russell and Rao coefficient
$$
\frac{a}{a+b+c+d} = \frac{a}{p}.
$$

When calculated these coefficients will give different values, but I can't find any resources which explain why I should choose one over the other. Is it just because for some datasets, the simultaneous absence of both attributes ($d$) doesn't convey any information?

Best Answer

There exist many such coefficients (most are expressed here). Just try to meditate on what are the consequences of the differences in formulas, especially when you compute a matrix of coefficients.

Imagine, for example, that objects 1 and 2 similar, as objects 3 and 4 are. But 1 and 2 have many of the attributes on the list while 3 and 4 have only few attributes. In this case, Russell-Rao (proportion of co-attributes to the total number of attributes under consideration) will be high for pair 1-2 and low for pair 3-4. But Jaccard (proportion of co-attributes to the combined number of attributes both objects have = probability that if either object has an attribute then they both have it) will be high for both pairs 1-2 and 3-4.

This adjustment for the base level of "saturation by attributes" makes Jaccard so popular and more useful than Russell-Rao, e.g. in cluster analysis or multidimensional scaling. You might, in a sense, further refine the above adjustment by selecting Kulczynski-2 measure which is the arithmetic mean probability that if one object has an attribute, the other object has it too: $$ (\frac{a}{a+b} + \frac{a}{a+c}) /2 $$ Here the base (or field) of attributes for the two objects is not pooled, as in Jaccard, but is own for each of the two objects. Consequently, if the objects differ greatly on the number of attributes they have, and all its attributes the "poorer" object shares with the "richer" one, Kulczynski will be high whereas Jaccard will be moderate.

Or you could prefer to compute geometric mean probability that if one object has an attribute, the other object has it too, which yields Ochiai measure: $$ \sqrt {\frac{a}{a+b} \frac{a}{a+c}} $$ Because product increases weaker than sum when only one of the terms grows, Ochiai will be really high only if both of the two proportions (probabilities) are high, which implies that to be considered similar by Ochiai the objects must share the great shares of their attributes. In short, Ochiai curbs similarity if $b$ and $c$ are unequal. Ochiai is in fact the cosine similarity measure (and Russell-Rao is the dot product similarity).


P.S.

Is it just because for some datasets, the simultaneous absence of both attributes (d) doesn't convey any information?

Speaking of similarity measures, one shouldn't mix nominal dichotomous attributes (e.g. female, male) with binary attributes (present vs absent). Binary attribute isn't symmetric (in general), - if you and I share a characteristic, it is the basis for calling us similar; if you and I both miss the characteristic, it may or may not be considered the evidence of similarity, depending on the context of the study. Hence the divergent treatment of $d$ is possible.

Note also that if you wish to compute similarity between objects based on 1+ nominal attributes (dichotomous or polytomous), recode each such variable into the set of dummy binary variables. Then the recommended similarity measure to compute will be Dice (which, when computed for 1+ sets of dummy variables, is equivalent to Ochiai and Kulczynski-2).

Related Question