Can I use Jaccard index to calculate similarity between set and multiset?
As I know Jaccard is defines as the size of the intersection divided by the size of the union of the sample sets,
that is $J(A, B) = |A \cap B| \, / \, |A \cup B|$
Now if I have a set $s$, $s = \{\text{special}, \text{words}\}$
and a multiset $m$, $m = \{\text{term}, \text{special}, \text{words}, \text{special}\}$
How can I use Jaccard index to take repetition into consideration?
Best Answer
You can use Generalized Jaccard Index, and assume that the set $s$ is actually a multiset:
Here you can read "vector" as "multiset", and $x_i$ is a count of element $i$ in the multiset $\mathbf x$.