Solved – Use hierarchical clustering in R to cluster items into fixed size clusters

machine learningr

I am trying to use R to do Kmeans clustering and as most people I ran into the challenge of determining when to finish. I have 10,000 items and potentially 10 times of that down the road. My goal is to create a series of clusters with minimal size (e.g. 50 items per cluster) OR reasonably similar items. In other words, I don't want any of my output clusters to be too small (even if the items are quite different from each other), but I also don't mind if the clusters are too big as long as the items are similar enough.

I imagine I can use some kind of divisive hierarchical approach. I can start by building a small number of clusters and examine each cluster to determine if it needs to be split into more clusters. I can keep doing this till all clusters meet my stopping criteria.

I wonder if anyone knows good information on how other people do this?

Best Answer

There is a whole family of hierarchical clustering which should suit your needs, as it creates a tree, where each level represents the bigger (more general) clusters. Analysis of this structure and some custom cutting will bring you to described solution.

In R you can check out this source http://cran.r-project.org/web/views/Cluster.html , where you will find some hierarchical clustering implementations.

The easiest approach would be to:

run hierarchical clustering (any) and analyze the tree and select clusters generality which fits your constraints
cluster with any existing method, and then prune the small clusters (remove them iteratively and assign each point to the nearest of the remaining clusters).

Best Answer

Related Solutions

R – Why Mantel’s Test is Preferred Over Moran’s I?

Solved – How to evaluate a clustering/unsupervised learning problem with massive amounts of data, with labels only for a small fraction of points

Related Question