I have a dataset with about 800 observations, each with about 2000 boolean variables. I would like to cluster the observations. I am using scipy in Python.
For a (dis)similarity measure I've chosen the "Hamming" method, since it seems specifically developed for binary/boolean data. I've run into an issue with this method.
For example:
V1 V2 V3 V4 V5 V6
R1 = 1 1 1 1 1 1
R2 = 1 1 1 1 1 0
R3 = 0 0 0 1 1 1
R4 = 0 0 0 1 1 0
Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/6=0.167
For my purposes however, the distance between R3 with R4 is more significant than the difference of R1 with R2. The 0 in my data stands for an absence of a variable (V).
The result that I am looking for is:
Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/3=0.333
So I guess I am looking to divide the Hamming over the the count of variables (V) where there is at least one "1" for the pair that is compared. Should I pick a different measure or should I try to a create a variation of the Hamming method?
Best Answer
As you noticed in the comment, what you are describing seems to be the inverse of Jaccard similarity. Jaccard similarity is defined as
$$ J(A, B) = \frac{| A \cap B |}{| A \cup B |} $$
In your case, union for R3 and R4 is V4, V5, V6, while intersection is V4 and V5, so it's 2/3. It is a similarity metric, to make it a distance metric take $1 - J(A, B)$. It's $1 - 5/6 = 1/6$ and $1 - 2/3 = 1/3$ as you wanted.