[Math] How Jaccard similarity can be approximated with minhash similarity

algorithmscomputer sciencedata mininghash function

In Page 81 of this book, Mining of Massive Data Sets. It says the following:

Now, consider the probability that $h(S_1) = h(S_2)$. If we imagine the rows
permuted randomly, and we proceed from the top, the probability that we shall
meet a type X row before we meet a type Y row is $\frac{x}{x + y}$. But if the
first row from the top other than type Z rows is a type X row, then surely
$h(S_1) = h(S_2)$. On the other hand, if the first row other than a type Z row
that we meet is a type Y row, then the set with a 1 gets that row as its minhash
value. However the set with a 0 in that row surely gets some row further down
the permuted list. Thus, we know $h(S_1) = h(S_2)$ if we first meet a type Y row.
We conclude the probability that $h(S_1) = h(S_2)$ is $\frac{x}{x + y}$, which is also the
Jaccard similarity of $S_1$ and $S_2$.

Could someone clarify why probability that $h(S_1) = h(S_2)$ is $\frac{x}{x + y}$ ? Second sentence of this paragraph doesn't make sense to me.

Best Answer

It is not that $h(S_1)=h(S_2)$ is true; in fact $h(S_1)=a$ and $h(S_2)=c$, as explained in the example at pag. 80, where $h$ is the minhash function associated to the permutation $\{abcde\}\mapsto \{beadc\}$ and $S_1$ resp. $S_2$ are given in figure 3.3. What is true is that the probability of having $h(S_1)=h(S_2)$, for generic $S_1$ and $S_2$ obtained by permutations and minhash, is equal to $$\frac{x}{x+y} $$ which is also the Jaccardi similarity of the 2 column sets. This is the meaning of the paragraph in the OP. To show this, the author imagines to move from the top of the column sets and uses the definitions of $X$, $Y$ and $Z$ classes of entries for both $S_1$ and $S_2$.

  • Add on

Let us discuss the statement " the probability of having $h(S_1)=h(S_2)$, for generic $S_1$ and $S_2$ obtained by permutations and minhash, is equal to $\frac{x}{x+y} $". In order to compute $h(S_1)$ and $h(S_2)$ for any two column sets $S_1$ and $S_2$, we need to identify the position(=row number) of the first $1$, starting from top of the column vectors for both $S_1$ and $S_2$: this is the definition of the minhash function $h(\cdot)$. To simplify the discussion the author introduces three classes of rows in $S_1$ and $S_2$: $X$, $Y$ and $Z$. The classes containing the 1's are $X$ (the element 1 is contained in the given row both for $S_1$ and $S_2$) and $Y$ (the element 1 appears in the given row in $S_1$ or in $S_2$).

Moving from top to down along both $S_1$ and $S_2$ we skip all rows of class $Z$: they contain no 1; we stop the search if we meet a row of class $X$ or class $Y$. We distinguish two cases:

  1. case 1: we meet a row of class $X$: as the element 1 appears in both $S_1$ and $S_2$ in the given row, we can compute $h(S_1)$ and $h(S_2)$; they are equal, i.e. $h(S_1)=h(S_2)$ as the row under examination is the same. The question is: which is the probability of meeting an $X$ row before than an $Y$ row while moving from top to bottom along $S_1$ and $S_2$? This probability is $\frac{x}{x+y}$: you should convince yourself of it by doing some explicit example. An idea would be to write down a simple cases, like $S_1 = (1,0,1)$ and $S_2 = (0,1,1)$. For the given permutation (leading to the $S_1$ and $S_2$ given above) $x=1$ and $y=2$. In fact, there exists $1/3\sim 33.33%$ of probability of having the unique row of class $X$, i.e. $(1,1)$ in front of the 2 rows of class $Y$ $(0,1)$ and $(1,0)$. Remember that $S_1$ and $S_2$ are subsets whose representation as vectors with 1-0 is given up to permutation.

  2. case 2. we meet a row of class $Y$: as the element 1 appears in either $S_1$ or $S_2$ in the given row, then we can compute either $h(S_1)$ or $h(S_2)$; their values cannot be equal by construction of the class $Y$.

Related Question