[Math] Jaccard similarity between columns with only cyclic permutations

elementary-set-theoryprobability

I am reading a text called Mining Massive Datasets by Jure Leskovec. Link to the chapter I'm reading here: MMD

I solved all of the exercises for this chapter except for exercise 3.3.6 which reads as follows:

One might expect that we could estimate the Jaccard similarity
of columns without using all possible permutations of rows. For example,
we could only allow cyclic permutations; i.e., start at a randomly chosen row r, which becomes the first in the order, followed by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the first row, second row, and so on, down to row r − 1. There are only n such permutations if there are n rows. However, these permutations are not sufficient to estimate the Jaccard similarity correctly. Give an example of a two-column matrix where averaging over all the cyclic permutations does not give the Jaccard similarity.

Would appreciate any help in figuring this out.

Best Answer

For the configuration shown below

$$ \begin{array}{c|c|lcr} \text{} & S_1 & S_2\\ \hline 1 & 0 & 0\\ 2 & 0& 1\\ 3 & 1 & 1 \end{array} $$

the Jaccard similarity of $S_1,S_2$ is $1/2$.

Start with the identity permutation and permute cyclically. Thus, let permutations $p_1,p_2,p_3$ be such that

\begin{align*} &p_1(S_1) = [0,0,1]&p_1(S_2) = [0,1,1]\\[6pt] &p_2(S_1) = [0,1,0]&p_2(S_2) = [1,1,0]\\[6pt] &p_3(S_1) = [1,0,0]&p_3(S_2) = [1,0,1] \end{align*}

and let $h_1,h_2,h_3$ be the corresonding hash functions. Then

\begin{align*} &h_1(S_1) = h(p_1(S_1)) = h([0,0,1]) = 3\\ &h_1(S_2) = h(p_1(S_2)) = h([0,1,1]) = 2 \end{align*}

\begin{align*} &h_2(S_1) = h(p_2(S_1)) = h([0,1,0]) = 2\\ &h_2(S_2) = h(p_2(S_2)) = h([1,1,0]) = 1 \end{align*}

\begin{align*} &h_3(S_1) = h(p_3(S_1)) = h([1,0,0]) = 1\\ &h_3(S_2) = h(p_3(S_2)) = h([1,1,0]) = 1 \end{align*}

so $S_1,S_2$ have the same hash value for permutation $p_3$, but not for permutations $p_1,p_2$.

Thus, the average number of hash matches for the $3$ cyclic permutations is $1/3$, not $1/2$.

Related Question