Solved – Clustering procedure where each cluster has an equal number of points

clusteringk-meansmachine learningunsupervised learning

I have some points $X=\{x_1,…,x_n\}$ in $R^p$, and I want to cluster the points so that:

  1. Each cluster contains an equal number of elements of $X$. (Assume that the number of clusters divides $n$.)

  2. Each cluster is "spatially cohesive" in some sense, like the clusters from $k$-means.

It's easy to think of a lot of clustering procedures that satisfy one or the other of these, but does anyone know of a way to get both at once?

Best Answer

I suggest a two-step approach:

  1. get a good initial estimates of the cluster centers, e.g. using hard or fuzzy K-means.

  2. Use Global Nearest Neighbor assignment to associate points with cluster centers: Calculate a distance matrix between each point and each cluster center (you can make the problem a bit smaller by only calculating reasonable distances), replicate each cluster center X times, and solve the linear assignment problem. You'll get, for each cluster center, exactly X matches to data points, so that, globally, the distance between data points and cluster centers is minimized.

Note that you can update cluster centers after step 2 and repeat step 2 to basically run K-means with fixed number of points per cluster. Still, it will be a good idea to get a good initial guess first.

Related Question