Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)
As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.
The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1.
There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).
You may be interested in studying this article:
Gunnar E. Carlsson, Facundo Mémoli:
Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)
and the references contained therein, in particular also:
Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.
These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.
kmeans
in R is pretty good Fortran code.
I haven't looked at the package you got kmeanspp
from, but I wouldn't expect it to be of the same quality. If it uses R code to do the initialization that can hurt badly.
So in the end, you haven't been benchmarking kmeans vs. kmeans++, but you have been showing that the quality of R packages varies a lot, and the R interpreter is substantially slower than Fortran/C/C++ functions, and you therefore cannot rely on such benchmarks.
The low performance of flexclust is probably because it uses even more R code?
Best Answer
You can see k-means as a special version of the EM algorithm, which may help a little.
Say you are estimating a multivariate normal distribution for each cluster with the covariance matrix fixed to the identity matrix for all, but variable mean $\mu_i$ where $i$ is the cluster's index. Clearly, if the parameters $\{\mu_i\}$ are known, you can assign each point $p$ its maximum likelihood cluster (ie. the $\mu_i$ for which the distance to $p$ in minimal). The EM algorithm for this problem is almost equivalent to k-means.
The other way around, if you know which points belong to which cluster, you can estimate the optimal $\mu_i$. The closed form solution to this (which finds a global optimum) basically says that to find the maximum likelihood models $\{\hat\mu_i\}$ you integrate over all possible assignments of points to clusters. Since even with just thirty points and two clusters, there are about a billion such possible assignments, this is unfeasible to calculate.
Instead, we can take some guess as to the hidden parameters (or the model parameters) and iterate the two steps (with the posibility of ending up in a local maximum). If you allow each cluster to take a partial responsibility for a point, you end up with EM, if you just assign the optimal cluster, you get k-means.
So, executive summary: in probabilistic terms, there is a global solution, but it requires you to iterate over all possible clusterings. Clearly if you have an objective function, the same is true. You could iterate over all solutions and maximize the objective function, but the number of iterations is exponential in the size of your data.