Solved – k-means with binary variables

clusteringdata miningk-meansmachine learning

I have converted all of my features to binary variables. now I have 21 features in my data set. I am trying to cluster them with k-means. I used Hamming distance in order to measure the distance between every instance and centroids at each steps.

But when I was trying to calculate the mean (in order to have a new centroid), I realized that taking a mean of binary variables does not make sense.

After doing some research I decided to use mode instead of mean. I used modes like this:

enter image description here

the rest of the algorithm is the same as k-means. but the problem is my error rate is too high.

my question is:
Am I doing it correctly? Do you have any suggestion for me to deal with these data? [if I only want the k-means output for this dataset]

update 1

I tried medians instead of modes (the same approach) and the result is still suffering from high error rate.

update 2

before clustering, I know about 70 percent of instances in my data set belongs to one group and about 25 percent belongs to another group. I think they affect the result of clustering. am I right?

Best Answer

I'd rather consider frequent itemset mining.

I think the problems you see arise from two assumptions:

  1. each object belongs to exactly one cluster
  2. each attribute has the same importance

Also, how do you evaluate? What is the "error rate" you are using?