Solved – What should be the optimum number of features for 10 million observations for kmeans clustering

dimensionality reductionhigh-dimensionalk-meansmachine learning

I have a dataset of 10 million observations and 100 million features. I have to perform kmeans clustering on that dataset. The approx value of k is 30000

Is it advisable to perform clustering with such huge number of features? What are the problems I may face using such huge number of features? (Currently, I am facing OutofMemory in Spark mllib kmeans)

Wouldn't it be better to perform PCA reduce the number of features or feature-vectors are re-engineered in such a way that it contains less number of features? What should be the ideal number of features? Is there any doc on high dimensionality and kmeans?

Best Answer

Any algorithm that relies on a distance metric in a high-dimensional space will suffer from the curse of dimensionality. In effect, all your observations are going to appear "far" from one another, with relatively little variation in the distance measurement, making the clustering very weak. You'd be much better off selecting informative features, and using only those to construct your distance metric for k-means. The ideal number of features is very much problem dependent, so there's no set guideline on a pre-defined number.