Solved – K-Means Cluster has over 50% of the points in one cluster. How to optimize it

clusteringhierarchical clusteringk-meansspark-mllib

I am running a clustering algorithm in Spark and I have to choose between
K-Means and Bisecting-Kmeans. However the only thing that differes between the two is the runtime because the performance is equally bad. I have a dataset of some 1.3 million entries and they have all been appropriately vectorized. When I run the algorithm for 150-200 clusters, the final output consists of at least only cluster that has over 400k entries. The rest is distributed among the rest. That big cluster with 400k entries is a big issue. Is there any way for me to optimize the clustering (I have little control over the Spark algorithm workflow)? Some way to force that big cluster to divide? Unfortunately I am also capped at how much memspace the calculation can take. Going over 250 clusters is risky because I just may get a Java Heap Space error. Any ideas, sugggestions, how to handle this situation?

Best Answer

This is very typical behavior of k-means when applies to non-continuous data. It's not what k-means is designed for, you are essentially operating it out of its specifications. Also, k-means is very sensitive to noise. You probably have a lot of one-element clusters, too?

Also Spark is one of the worst tools for clustering. Consider getting the C code from BaylorML / Greg Hamerly. You will be surprised by how much faster it is. People always assume Spark would be the fastest, but the only thing it actually outperforms is Hadoop MapReduce. Depending on your sparsity, 1.3 Mio points should still fit into main memory of a single machine. Then tools such as BaylorML and ELKI will just shine and be a lot faster than Spark.

But that doesn't really help you with your clustering problem, because it most likely is a data problem.

I suggest you do A) visualize your data and the clustering results (PCA is more appropriate than tSNE because it preserves distances better, so you see the outliers!) B) start with a sample rather than all 1.3 million at once! Only scale up once you have a working approach. And you may need to use other clustering algorithms than k-means...

Related Question