K-Means Clustering – Usefulness on High Dimensional Data

clusteringk-means

I wonder what is the usefulness of k-means clustering in high dimensional spaces, and why it can be better (or not) than other clustering methods when dealing with high dimensional spaces.

Best Answer

Is k-means meaningful at all?

See for example my answer here: https://stats.stackexchange.com/a/35760/7828

k-means optimizes variances. Is the unweighted sum of variances meaningful on your data set? Probably not. How can then k-means be meaningful? In high-dimensional data, distance doesn't work. But variance = squared Euclidean distance; so is it meaningful to optimize something of which you know it doesn't work in high-dimensional data?

For the particular problems of high-dimensional data, I recommend the following study:

Zimek, A., Schubert, E. and Kriegel, H.-P. (2012), A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analy Data Mining, 5: 363–387. doi: 10.1002/sam.11161

It's main focus is outlier detection, but the observations on the challenges of high-dimensional data apply to a much broader context. They show some simple experiments how high-dimensional data can be a problem. What I like about this study is they also show that high-dimensional data can be easy, too; it's not black and white, but you need to carefully study your data.

Useful is different. Often people use k-means not to actually discover clusters.

But to find representative objects. It's a clever way of semi-random sampling k objects that aren't too similar to be useful.

If you only need a clever way of sampling, k-means may be very useful.