Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)
As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.
The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1.
There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).
You may be interested in studying this article:
Gunnar E. Carlsson, Facundo Mémoli:
Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)
and the references contained therein, in particular also:
Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.
These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.
From the Conclusion of Murtaugh, F. & Legendre, P. (2011). Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm, ArXive:1111.6285v2 (pdf):
Two algorithms, Ward1 and Ward2...When applied to the same distance matrix D, they produce different results. This article has shown that when they are applied to the same dissimilarity matrix D, only Ward2 minimizes the Ward clustering criterion and produces the Ward method. The Ward1 and Ward2 algorithms can be made to optimize the same criterion and produce the same clustering topology by using Ward1 with D-squared and Ward2 with D.
For example, hclust(dist(x)^2,method="ward")
is equivalent to hclust(dist(x),method="ward.D2")
.
Best Answer
It should be noted that K-means is an insufficient method that was developed for adding machines and is now outdated. I justified this in detail in my article "Reclassification formula that provides to surpass K-means method" (2012) in arXiv. Ward's clustering combined with advanced K-means (so called, K-meanless method) provides the accurate optimal clustering with the lowest possible approximation error for each cluster number. This is true, at least for digital images I work with. M. Kharinov