Solved – Jaccard index between set and multiset

jaccard-similarity

Can I use Jaccard index to calculate similarity between set and multiset?

As I know Jaccard is defines as the size of the intersection divided by the size of the union of the sample sets,

that is $J(A, B) = |A \cap B| \, / \, |A \cup B|$

Now if I have a set $s$, $s = \{\text{special}, \text{words}\}$
and a multiset $m$, $m = \{\text{term}, \text{special}, \text{words}, \text{special}\}$

How can I use Jaccard index to take repetition into consideration?

Best Answer

You can use Generalized Jaccard Index, and assume that the set $s$ is actually a multiset:

If $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$ are two vectors with all real $x_i, y_i \geq 0$, then their Jaccard similarity coefficient is defined as $$J(\mathbf{x}, \mathbf{y}) = \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)}.$$

Here you can read "vector" as "multiset", and $x_i$ is a count of element $i$ in the multiset $\mathbf x$.

Related Question