I have converted all of my features to binary variables. now I have 21 features in my data set. I am trying to cluster them with k-means. I used Hamming distance in order to measure the distance between every instance and centroids at each steps.
But when I was trying to calculate the mean (in order to have a new centroid), I realized that taking a mean of binary variables does not make sense.
After doing some research I decided to use mode instead of mean. I used modes like this:
the rest of the algorithm is the same as k-means. but the problem is my error rate is too high.
my question is:
Am I doing it correctly? Do you have any suggestion for me to deal with these data? [if I only want the k-means output for this dataset]
update 1
I tried medians instead of modes (the same approach) and the result is still suffering from high error rate.
update 2
before clustering, I know about 70 percent of instances in my data set belongs to one group and about 25 percent belongs to another group. I think they affect the result of clustering. am I right?
Best Answer
I'd rather consider frequent itemset mining.
I think the problems you see arise from two assumptions:
Also, how do you evaluate? What is the "error rate" you are using?