[Math] Proving triangle inequality holds for Jaccard Distance

data miningmetric-spacesprobabilityproof-explanation

The following explanation is given in the Mining of Massive Datasets Textbook to prove that the triangle inequality holds for the Jaccard Distance (this sign $!=$ means does not equal):

"For the triangle inequality, recall from Section 3.3.3 that $\text{SIM}(x, y)$ is the probability a random minhash function maps $x$ and $y$ to the same value.
Thus, the Jaccard distance $d(x, y)$ is the probability that a random minhash
function does not send $x$ and $y$ to the same value. We can therefore
translate the condition $d(x, y) \leq d(x, z) + d(z, y)$ to the statement that, if $h$ is a random minhash function, then the probability that $h(x) != h(y)$
is no greater than the sum of the probability that $h(x) != h(z)$ and the
probability that $h(z) != h(y)$. However, this statement is true because
whenever $h(x) != h(y)$, at least one of $h(x)$ and $h(y)$ must be different
from $h(z)$. They could not both be $h(z)$, because then $h(x)$ and $h(y)$
would be the same."

reference link : http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

May someone elaborate further on this , especially the two last sentences? That is where the gist of the proof is but if someone could present it in a different and more "step by step" way, then that would be great.

Thank you.

Best Answer

Let $h$ be a random minhash function (note --- I'm not entirely sure what a minhash function is, but I don't believe this actually matters for the explaination).

We have that: $$d(x,y) = \mathbb{P}[h(x)!=h(y)]$$ We want to show that: $$\mathbb{P}[h(x) != h(y)] \leq \mathbb{P}[h(x) != h(z)] + \mathbb{P}[h(z) != h(y)]$$ When they say "When $h(x)!= h(y)$", this is the same as conditioning each probability on the fact that $h(x)!=h(y)$, so we get: $$\mathbb{P}[h(x)!=h(y)\mid h(x)!=h(y)] \leq \overbrace{\mathbb{P}[h(x)!=h(z)\mid h(x)!=h(y)]}^{A} +\overbrace{\mathbb{P}[h(z)!=h(y)\mid h(x)!=h(y)]}^{B}$$ Clearly, the very left probability is $1$. Now, if at least one of the probabilities on the right is $1$, we'll be done. It's easy to see this will be the case. Imagine $h(x) = h(z)$. Then, the probability marked $A$ will be $0$, but because $h(x)!=h(y)$, we'll have that $h(z) != h(y)$ (as $h(x) = h(z)$ here), so the probability marked $B$ will always occur (so will contribute $1$), so the inequality will be satisfied.

Now, imagine $h(z) = h(y)$. Then, reversing the above argument, $B$ will contribute $0$ and $A$ will contribute $1$, and the inequality will be satisfied.

If neither of these are true, then we have that $h(z)!=h(y)$ and $h(x)!=h(z)$, so $A$ and $B$ will both contribute $1$, and the inequality is satisfied.

So essentially, they just check all the possibilities. One possibility they don't check is when the very left probability is $0$ (so condition all events on $h(x) = h(y)$, but we don't need to do this, as all probabilities are greater than or equal to $0$, so whatever appears on the right will satisfy the inequality.