Solved – Use hierarchical clustering in R to cluster items into fixed size clusters

machine learningr

I am trying to use R to do Kmeans clustering and as most people I ran into the challenge of determining when to finish. I have 10,000 items and potentially 10 times of that down the road. My goal is to create a series of clusters with minimal size (e.g. 50 items per cluster) OR reasonably similar items. In other words, I don't want any of my output clusters to be too small (even if the items are quite different from each other), but I also don't mind if the clusters are too big as long as the items are similar enough.

I imagine I can use some kind of divisive hierarchical approach. I can start by building a small number of clusters and examine each cluster to determine if it needs to be split into more clusters. I can keep doing this till all clusters meet my stopping criteria.

I wonder if anyone knows good information on how other people do this?

Best Answer

There is a whole family of hierarchical clustering which should suit your needs, as it creates a tree, where each level represents the bigger (more general) clusters. Analysis of this structure and some custom cutting will bring you to described solution.

In R you can check out this source http://cran.r-project.org/web/views/Cluster.html , where you will find some hierarchical clustering implementations.

The easiest approach would be to:

  • run hierarchical clustering (any) and analyze the tree and select clusters generality which fits your constraints
  • cluster with any existing method, and then prune the small clusters (remove them iteratively and assign each point to the nearest of the remaining clusters).
Related Question