Solved – boxcox transform variables before k-means clustering

clusteringk-means

I was experimenting with clustering using kmeans on a large data set having mostly numerical variables (How Americans spend their time) in my class. A few of them (like education, marital status etc) that were categorical were transformed to dummy variables. As numerical variables (such as TV time, time spent with children, no of children etc) had varying value-range. It was necessary to scale and transform them.

A question has arisen whether besides scaling should we also try to make skewed data symmetric by using, BoxCox, transformations. In, R, caret package has a nice and easy pre-processing facility and in one command can carry out all transformations and scaling.

My question is: As the BoxCox transformation is a power-transformation, will it not change the shape of data distribution and we may get a different set of clusters than were originally existing in the dataset? While linear scaling of data may be OK, I am little skeptical about BoxCox transformations.

I will be grateful for an answer.

Best Answer

k-means is very sensitive to data distribution.

In particular the use of k-means on binary (e.g. dummy) variables is questionable, because the mean does not make too much sense anymore. You cannot easily map back cluster centers to attribute values!

As for using box-cox with k-means that is actually a good thing. k-means does not handle skewed distributions well. So e.g. an attribute "income" (which is notoriously skewed) such a transformation may improve results a lot.

Related Question