Solved – Jaccard similarity in R

jaccard-similarityr

I want to compare 2 vectors of length 43; they have values of 0 (not present) and 1 (present). I will refer to $M_{1,1}$ as situations in which both 1 are present, and $M_{1,0}$ and $M_{0,1}$ to situations in with only one 1 is present while the other value is 0.

data3$IDS  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 
       0 0 0 0 0 0 0 0 0 0
data3$CESD 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
           1 1 1 1 1 1 1 1 1 1 

I want to understand how related these 2 vectors are. Reading up on the topic, the Jaccard index seems the way to go. In this specific case, the Jaccard index would be (note that I am using the formula given next to the second figure on Wikipedia):
$$
\frac{M_{1,1}}{(M_{1,0} + M_{0,1} – M_{1,1})}
$$
In my case: $8 / (23 + 12 – 8) = 0.2962963$

Using:

library('clusteval')
cluster_similarity(data3$IDS, data3$CESD, similarity="jaccard", method="independence")

Returns:

0.553429

I can't quite figure out why, and where the mistake is that I make.

Another thing I do not understand is in cases of high overlap. Imagine $M_{1,1} = 30$, with only $2$ values each in the cells $M_{1,0}$ and $M_{0,1}$. This would lead to a Jaccard index of $30/(2+2-30) = -1.153846$.

But the J index is only defined between 0 and 1. Where is my misunderstanding?

Best Answer

Looking at the Wikipedia page's edit history, it seems the problem was due to a confusion about the two types of mathematical notation that are used to represent the index. Using notation from set theory, we have:
$$ J(A,B) = \frac{|A\cap B|}{|A\cup B|} = \frac{|A\cap B|}{|A| + |B| - |A\cap B|} $$ where $\cap$ denotes the intersection, $\cup$ denotes the union, and $\lvert\ \rvert$ denotes the cardinality.

Lower down, the formula was presented algebraically using counts from a matrix / contingency table $M$:
$$ J = \frac{M_{11}}{M_{10}+M_{01}+M_{11}} $$ This seemed contradictory to an editor who commented that there was an "Erro in formula [sic]. Should be minus the intersection".

The two formulas are in fact consistent because although $|A\cap B|=M_{11}$, $|A|\ne M_{10}$ and $|B|\ne M_{01}$. The algebraic formula could have been presented (in a manner that is more cumbersome, but more clearly parallel to the top formula) like this:
$$ J = \frac{M_{11}}{\sum_j M_{1j} + \sum_i M_{i1} - M_{11}} $$

Related Question