I am reading a text called Mining Massive Datasets by Jure Leskovec. Link to the chapter I'm reading here: MMD
I solved all of the exercises for this chapter except for exercise 3.3.6 which reads as follows:
One might expect that we could estimate the Jaccard similarity
of columns without using all possible permutations of rows. For example,
we could only allow cyclic permutations; i.e., start at a randomly chosen row r, which becomes the first in the order, followed by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the first row, second row, and so on, down to row r − 1. There are only n such permutations if there are n rows. However, these permutations are not sufficient to estimate the Jaccard similarity correctly. Give an example of a two-column matrix where averaging over all the cyclic permutations does not give the Jaccard similarity.
Would appreciate any help in figuring this out.
Best Answer
For the configuration shown below
$$ \begin{array}{c|c|lcr} \text{} & S_1 & S_2\\ \hline 1 & 0 & 0\\ 2 & 0& 1\\ 3 & 1 & 1 \end{array} $$
the Jaccard similarity of $S_1,S_2$ is $1/2$.
Start with the identity permutation and permute cyclically. Thus, let permutations $p_1,p_2,p_3$ be such that
\begin{align*} &p_1(S_1) = [0,0,1]&p_1(S_2) = [0,1,1]\\[6pt] &p_2(S_1) = [0,1,0]&p_2(S_2) = [1,1,0]\\[6pt] &p_3(S_1) = [1,0,0]&p_3(S_2) = [1,0,1] \end{align*}
and let $h_1,h_2,h_3$ be the corresonding hash functions. Then
\begin{align*} &h_1(S_1) = h(p_1(S_1)) = h([0,0,1]) = 3\\ &h_1(S_2) = h(p_1(S_2)) = h([0,1,1]) = 2 \end{align*}
\begin{align*} &h_2(S_1) = h(p_2(S_1)) = h([0,1,0]) = 2\\ &h_2(S_2) = h(p_2(S_2)) = h([1,1,0]) = 1 \end{align*}
\begin{align*} &h_3(S_1) = h(p_3(S_1)) = h([1,0,0]) = 1\\ &h_3(S_2) = h(p_3(S_2)) = h([1,1,0]) = 1 \end{align*}
so $S_1,S_2$ have the same hash value for permutation $p_3$, but not for permutations $p_1,p_2$.
Thus, the average number of hash matches for the $3$ cyclic permutations is $1/3$, not $1/2$.