Solved – k means with binary variables

clusteringdistancer

Is it OK to use kmeans with binary variables? I mean Euclidean distance? I guess the binary variables will be the ones that get the most power to determine the result.

Look at the following example:

data= data.frame(a=c(1,0,1,1), b=c(0.1,.2,.6,.8))
plot(data)
kmeans(data,2)
## Clustering vector: [1] 1 2 1 1

So the result is determined by the binary variable.

Is there a way to treat binary variables differently? Should I use Manhattan distance for all variables?

Best Answer

K-means uses the mean.

Relevant properties of the mean:

  • minimizes the L2 errors (sum of squares, squared Euclidean distance)
  • is continuous
  • assumes linear data (see below for an example)

Technically, you can run k-means on binary data, but as you have observed there is a tendency for the algorithm to converge to local minima that are determined by single/few bits.

You can easily provoke the opposite effect, too. Scale your continuos attribute to 10000000 and the algorithm will ignore the binary attributes.

K-means assumes that all attributes are equally important; more precisely that a diffence of x has the same importance independent of the attribute where it occurs and the absolute values where it occurs. So the difference of a binary value is as important as the difference of \$0 to \$1 in price of a burger, or \$9999 to \$10000 when buying a house... I this invariance does not hold for your data, do not use k-means (or preprpcess your data until this seems to hold).