Solved – k-means with binary variables

I have converted all of my features to binary variables. now I have 21 features in my data set. I am trying to cluster them with k-means. I used Hamming distance in order to measure the distance between every instance and centroids at each steps.

But when I was trying to calculate the mean (in order to have a new centroid), I realized that taking a mean of binary variables does not make sense.

After doing some research I decided to use mode instead of mean. I used modes like this:

the rest of the algorithm is the same as k-means. but the problem is my error rate is too high.

my question is:
Am I doing it correctly? Do you have any suggestion for me to deal with these data? [if I only want the k-means output for this dataset]

update 1

I tried medians instead of modes (the same approach) and the result is still suffering from high error rate.

update 2

before clustering, I know about 70 percent of instances in my data set belongs to one group and about 25 percent belongs to another group. I think they affect the result of clustering. am I right?

Solved – k-means with binary variables

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Using real data means as centroids for clustering

Solved – Probability for selecting centroids – K-means++

Related Question