Hamming Distance – How to Calculate While Ignoring Matching Zeros for Binary Data

binary dataclusteringhamming distancehierarchical clustering

I have a dataset with about 800 observations, each with about 2000 boolean variables. I would like to cluster the observations. I am using scipy in Python.

For a (dis)similarity measure I've chosen the "Hamming" method, since it seems specifically developed for binary/boolean data. I've run into an issue with this method.

For example:

     V1 V2 V3 V4 V5 V6
R1 = 1  1  1  1  1  1
R2 = 1  1  1  1  1  0
R3 = 0  0  0  1  1  1
R4 = 0  0  0  1  1  0

Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/6=0.167

For my purposes however, the distance between R3 with R4 is more significant than the difference of R1 with R2. The 0 in my data stands for an absence of a variable (V).
The result that I am looking for is:

Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/3=0.333

So I guess I am looking to divide the Hamming over the the count of variables (V) where there is at least one "1" for the pair that is compared. Should I pick a different measure or should I try to a create a variation of the Hamming method?

Best Answer

As you noticed in the comment, what you are describing seems to be the inverse of Jaccard similarity. Jaccard similarity is defined as

$$ J(A, B) = \frac{| A \cap B |}{| A \cup B |} $$

In your case, union for R3 and R4 is V4, V5, V6, while intersection is V4 and V5, so it's 2/3. It is a similarity metric, to make it a distance metric take $1 - J(A, B)$. It's $1 - 5/6 = 1/6$ and $1 - 2/3 = 1/3$ as you wanted.

Related Question