Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)
As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.
The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1.
There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).
You may be interested in studying this article:
Gunnar E. Carlsson, Facundo Mémoli:
Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)
and the references contained therein, in particular also:
Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.
These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.
See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).
Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.
I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.
The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution.
Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.
Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.
And in the end go and try out stuff. It's data exploration, not "return(truth);
". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.
Best Answer
This may be one of the cases where there's more art than science to clustering. I would suggest that you let your clustering algorithm run for a short time before letting the C-Index calculations kick in. "Short time" may be after processing a few pairs, just when it starts to exceed 0, or some other heuristic. (After all you don't expect to stop at 1 or 2 clusters, otherwise a different separation algorithm may have been deployed.)
For a book recommendation, I can suggest:
You can scan/search the available contents on google books to see if it might meet your needs. It's worked as a reference for me in the past.