Solved – Any distance measures that are more useful for binary data clustering

binary dataclusteringdistancehierarchical clusteringmetric

I was taking a look at Clustering a binary matrix but it didn't seem to answer my question.

I used a basic euclidean distance measure which definitely works but I am exploring alternative distance measures. All the distance measures I know of can be applied to binary data, but are not specific to binary data.

This data I'm dealing with is binary and I was wondering if there are any measures of distance for binary vectors/matrices?

I use Python 3 and here is a script I made to produce a dendrogram from the binary clusters. Essentially, I would be looking for alternatives to pairwise_distances(DF_data, metric="euclidean"). I could even manually code them in myself but mostly looking for distance measures known to work well with this type of data.

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_style("white")

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage
from sklearn.metrics.pairwise import pairwise_distances

%matplotlib inline

A_data = np.array([[0,0,1,1,0,0],
                  [0,1,1,1,0,0],
                  [0,0,0,0,0,1],
                  [0,0,0,0,1,1],
                  [1,1,1,1,0,0]])

DF_data = pd.DataFrame(A_data, 
                       index = ["sample_%d" % i for i in range(A_data.shape[0])], 
                       columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# >>> DF_data
#           attr_0  attr_1  attr_2  attr_3  attr_4  attr_5
# sample_0       0       0       1       1       0       0
# sample_1       0       1       1       1       0       0
# sample_2       0       0       0       0       0       1
# sample_3       0       0       0       0       1       1
# sample_4       1       1       1       1       0       0

# Distance Matrix
cA_euclid = distance.squareform(pairwise_distances(DF_data, metric="euclidean"))
# array([ 1.        ,  1.73205081,  2.        ,  1.41421356,  2.        ,
#         2.23606798,  1.        ,  1.        ,  2.23606798,  2.44948974])

# Linkage Matrix
Z = linkage(cA_euclid, method="average")

# Dendrogram
dendrogram(Z, labels=DF_data.index)

enter image description here

Best Answer

You could consider the Hamming distance between the two vectors, which is just the number of coordinates whose values differ. If your vectors contain only zeros and ones then this is equivalent to the $L_1$ norm of the difference.