Solved – Any distance measures that are more useful for binary data clustering

I was taking a look at Clustering a binary matrix but it didn't seem to answer my question.

I used a basic euclidean distance measure which definitely works but I am exploring alternative distance measures. All the distance measures I know of can be applied to binary data, but are not specific to binary data.

This data I'm dealing with is binary and I was wondering if there are any measures of distance for binary vectors/matrices?

I use Python 3 and here is a script I made to produce a dendrogram from the binary clusters. Essentially, I would be looking for alternatives to pairwise_distances(DF_data, metric="euclidean"). I could even manually code them in myself but mostly looking for distance measures known to work well with this type of data.

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_style("white")

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage
from sklearn.metrics.pairwise import pairwise_distances

%matplotlib inline

A_data = np.array([[0,0,1,1,0,0],
                  [0,1,1,1,0,0],
                  [0,0,0,0,0,1],
                  [0,0,0,0,1,1],
                  [1,1,1,1,0,0]])

DF_data = pd.DataFrame(A_data, 
                       index = ["sample_%d" % i for i in range(A_data.shape[0])], 
                       columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# >>> DF_data
#           attr_0  attr_1  attr_2  attr_3  attr_4  attr_5
# sample_0       0       0       1       1       0       0
# sample_1       0       1       1       1       0       0
# sample_2       0       0       0       0       0       1
# sample_3       0       0       0       0       1       1
# sample_4       1       1       1       1       0       0

# Distance Matrix
cA_euclid = distance.squareform(pairwise_distances(DF_data, metric="euclidean"))
# array([ 1.        ,  1.73205081,  2.        ,  1.41421356,  2.        ,
#         2.23606798,  1.        ,  1.        ,  2.23606798,  2.44948974])

# Linkage Matrix
Z = linkage(cA_euclid, method="average")

# Dendrogram
dendrogram(Z, labels=DF_data.index)

Solved – Any distance measures that are more useful for binary data clustering

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Clustering with asymmetrical distance measures

Solved – (hierarchical) cluster analysis with non-standard distance

Related Question