Solved – How to understand the drawbacks of Hierarchical Clustering

clusteringhierarchical clusteringk-meansunsupervised learning

Can someone explain the pros and cons of Hierarchical Clustering?

  1. Does Hierarchical Clustering have the same drawbacks as K means?
  2. What are the advantages of Hierarchical Clustering over K means?
  3. When should we use K means over Hierarchical Clustering & vice versa?

Answers to this post explains the drawbacks of k means very well.
How to understand the drawbacks of K-means

Best Answer

Whereas $k$-means tries to optimize a global goal (variance of the clusters) and achieves a local optimum, agglomerative hierarchical clustering aims at finding the best step at each cluster fusion (greedy algorithm) which is done exactly but resulting in a potentially suboptimal solution.

One should use hierarchical clustering when underlying data has a hierarchical structure (like the correlations in financial markets) and you want to recover the hierarchy. You can still apply $k$-means to do that, but you may end up with partitions (from the coarsest one (all data points in a cluster) to the finest one (each data point is a cluster)) that are not nested and thus not a proper hierarchy.

If you want to dig into finer properties of clustering, you may not want to oppose flat clustering such as $k$-means to hierarchical clustering such as the Single, Average, Complete Linkages. For instance, all these clustering are space-conserving, i.e. when you are building clusters you do not distort the space, whereas a hierarchical clustering such as Ward is not space-conserving, i.e. at each merging step it will distort the metric space.

To conclude, the drawbacks of the hierarchical clustering algorithms can be very different from one to another. Some may share similar properties to $k$-means: Ward aims at optimizing variance, but Single Linkage not. But they can also have different properties: Ward is space-dilating, whereas Single Linkage is space-conserving like $k$-means.

-- edit to precise the space-conserving and space-dilating properties

Space-conserving: $$D_{ij} \in \left[ \min_{x \in C_i, y \in C_j} d(x,y), \max_{x \in C_i, y \in C_j} d(x,y) \right]$$ where $D_{ij}$ is the distance between clusters $C_i$ and $C_j$ you want to merge, and $d$ is the distance between datapoints.

Space-dilating: $$D(C_i \cup C_j, C_k) \geq \max(D_{ik}, D_{jk}),$$ i.e. by merging $C_i$ and $C_j$ the algorithm will push further away the cluster $C_k$.

Related Question