Algorithms like Lloyds can be implemented with $k\cdot(2\cdot d + 1)$ floating point values memory use only. MacQueens k-means algorithm should only need $k\cdot(d + 1)$ memory.
However, as most users will want to know which point belongs to which cluster, almost every implementation you'll find will use $O(n+k\cdot d)$ memory.
In other words, the memory use by k-means is essentially the output data size.
Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)
As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.
The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1.
There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).
You may be interested in studying this article:
Gunnar E. Carlsson, Facundo Mémoli:
Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)
and the references contained therein, in particular also:
Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.
These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.
Best Answer
Why don't just first look at other clustering algorithms? For example DBSCAN does not need to know the number of clusters, and it can work with any distance function (assuming that you want to go with something like cosine distance for your sparse vectors).
Given the characteristics of your data, k-means is just bound to fail. The problem is that the means will most likely be no longer sparse, so they are actually outliers, and by no means central objects. Don't use k-means with sparse vectors or non-euclidean distances! (k-means may not converge for other distance functions, as the mean might not minimize the objective function anymore!)
Seriously, there are at least 100 more modern clustering algorithms. Try these first.