Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)
As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.
The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1.
There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).
You may be interested in studying this article:
Gunnar E. Carlsson, Facundo Mémoli:
Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)
and the references contained therein, in particular also:
Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.
These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.
Consider using a graph based approach.
Try to find a threshold to define when users are "somewhat similar". It can be quite low. Build a graph of these somewhat similar users.
Then use a Clique detection approach to find groups in this graph.
Best Answer
Ward's method is about minimizing squared deviations, similar to k-means. Squared deviations are related to squared Euclidean distance.
To better understand the relationship, have a look at the underlying math and variance. Variance, by its concept, is meant for continuous variables, not for binary data (where the mean is not a meaningful data point anymore).
This implies it should not be used with other distances, unless you can prove them to be equivalent to squared Euclidean distance in some kernel space, or maybe a Bregman divergence, etc.
For a distance (or similarity) of binary variables such as Dice or Hamming or Jaccard, other linkages such as single-linkage and complete-linkage are more meaningful and interpretable. Precisely, single linkage will still mean that every point in the cluster is connected by steps of at most this distance, and in complete linkage every point in a cluster is connected to every other point in the cluster with at most this distance.