Solved – Choosing the number of clusters in hierarchical agglomerative clustering

clusteringdbscanhachierarchical clustering

I have a set of points that I want to cluster into groups according to a number of features computed.

I have distance matrix containing the distances between all different pairs of points. I have tried K-Means, and DBSCAN first but since I have no clue about that number of clusters underlying in the data, required for K-Means or the optimal Epsilon and MinPnts required for DBSCAN. I decided to turn to Hierarchical Agglomeration Clustering since it doesn't require setting any parameters for clustering.

I used R to perform HAC. Now to turn the resultant dendrogram into a number of groups of points (flat clusters), I want to choose which level to cut the tree at (the same as choosing the number of clusters).

I managed to do this for different values for the number of clusters and evaluate the Silhouette value for each clustering results and therefore I can choose the number of clusters that gives me the maximum silhouette value as the optimal number of clusters.

What happens is that I find that the Maximum silhouette value (0.8) obtained is for a number of clusters = 5 but the cluster sizes are not very good (one cluster is > 900 points, second one is 5 points, and the other three are single point for each).
I was using group average for the HAC cluster-cluster distance..

I tried also Ward's method (which is supposed to give clusters of equal sizes) but although this one give me somehow better distribution for the number of points in each cluster… The silhouette value for the clusters produced is really low (0.1).

Do you suggest any good methods for doing clustering without requiring parameters? or to obtain the optimal values for clustering parameters (dendrogram cutting level for HAC, (epsilon, MinPnts) for DBSCAN)?

Best Answer

See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).

Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.

I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.

The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution. Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.

Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.

And in the end go and try out stuff. It's data exploration, not "return(truth);". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.