[Math] Proof of calculating MinHash

hash functionprobabilityproof-writing

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and $h_\min(S)$ is the minimum hash of set S, i.e. $h_\min(S) = \min(h(s))$ for s in S. We have the equation:
$$
p(h_\min(A) = h_\min(B)) = \frac{|A \cap B|}{|A \cup B|}
$$
Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of $A$ and $B$.

I am trying to prove above equation and come up with a proof, which is:
for $a \in A$ and $b \in B$ such that $h(a) = h_\min(A)$ and $h(b) = h_\min(B)$. So, if $h_\min(A) = h_\min(B)$ then $h(a) = h(b)$. Assume that hash function h can hash keys to distinct hash value, so $h(a) = h(b)$ if and only if $a = b$, which is $\frac{|A \cap B|}{|A \cup B|}$. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function. Thanks.

Best Answer

It seems like it is an underlying assumption in this formula that different elements map to distinct hashes. Otherwise, just have your hash function map everyone to 0. The probability you are looking at would be 1 regardless of any sets.

Best Answer

Related Solutions

[Math] How Jaccard similarity can be approximated with minhash similarity

[Math] Jaccard similarity between columns with only cyclic permutations

Related Question