Is it OK to use kmeans
with binary variables? I mean Euclidean distance? I guess the binary variables will be the ones that get the most power to determine the result.
Look at the following example:
data= data.frame(a=c(1,0,1,1), b=c(0.1,.2,.6,.8))
plot(data)
kmeans(data,2)
## Clustering vector: [1] 1 2 1 1
So the result is determined by the binary variable.
Is there a way to treat binary variables differently? Should I use Manhattan distance for all variables?
Best Answer
K-means uses the mean.
Relevant properties of the mean:
Technically, you can run k-means on binary data, but as you have observed there is a tendency for the algorithm to converge to local minima that are determined by single/few bits.
You can easily provoke the opposite effect, too. Scale your continuos attribute to 10000000 and the algorithm will ignore the binary attributes.
K-means assumes that all attributes are equally important; more precisely that a diffence of x has the same importance independent of the attribute where it occurs and the absolute values where it occurs. So the difference of a binary value is as important as the difference of \$0 to \$1 in price of a burger, or \$9999 to \$10000 when buying a house... I this invariance does not hold for your data, do not use k-means (or preprpcess your data until this seems to hold).