Solved – agglomerative clustering sensitivity to outliers: single-link vs complete-link

clusteringhierarchical clusteringoutliers

Agglomerative clustering can use various measures to calculate distance between two clusters, which is then used to decide which two clusters to merge.

Two popular approaches are single-link and complete-link. There seems to be some discrepancy in whether single-link or complete-link is sensitive to outliers. I am stating a few examples below but I am sure that there are many more.

(1) Stanford NLP IR book states "[complete-link] causes sensitivity to outliers".

(2) Chapter 8 in the Data Mining book by Tan and Kumar says the following in the section 8.3.2:

"The single link technique is good at handling non-elliptical shapes, but is sensitive to noise and outliers."

"Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors globular shapes."

(3) The Applied Multivariate Statistics book says (page 533):

"… single link method which is also very sensitive to errors of measurement, but somewhat robust to outliers. The complete link and Ward’s method tend to find compact clusters of nearly equal size with the clustering solution adversely affected by outliers."

To me it intuitive sense that complete-link is more sensitive to outliers a it uses max over the distances between the points in two given clusters which is a non local measure.

So the question is which one is the correct view? Or maybe both are correct and are saying different things that I do not understand?

Best Answer

The complete link will merge outliers late. Because they increase maximum distances much. It will usually prefer inliers first, which is good. So I'd have assumed it to be more robust than single-link.

But without a good definition of "outlier" and "robust" it is not a well defined claim.

Related Question