My data is is mostly continuous but has one binary variable. I tried the pam
algorithm in R with the Gower index, but the number of clusters that give the best silhouette width is 2 – allowing the binary variable to completely dominate the result.
- Is PAM the wrong approach?
- Is it OK to choose a higher k just because it will give more meaningful results?
Best Answer
If the binary variable is not very useful, try putting less weight on it.
There is nothing wrong with having a domain expert manually assign weights to different attributes to help the algorithm find new information. That the binary attribute splits the data into two is a correct result, now you want to find something new, so either remove it (weight 0) or at least reduce the weight.