The Ward clustering algorithm is a hierarchical clustering method that minimizes an 'inertia' criteria at each step. This inertia quantifies the sum of squared residuals between the reduced signal and the initial signal: it is a measure of the variance of the error in an l2 (Euclidean) sens. Actually, you even mention it in your question. This is why, I believe, it makes no sens to apply it to a distance matrix that is not a l2 Euclidean distance.
On the other hand, an average linkage or a single linkage hierarchical clustering would be perfectly suitable for other distances.
A somewhat complicated answer to your first question about distances goes like this.
Rencher's Multivariate Analysis cites Lance and Williams (1967) who proposed a general (flexible beta) method underlying most of the existing hierarchical cluster analysis: for three objects A, B and C, be that clusters or individual points considered to be clusters of size 1, if clusters A and B are joined into AB, then the new distance from AB to C can be expressed by
$$
d(AB,C) = \alpha_A d(C,A) + \alpha_B d(C,B) + \beta d(A,B) + \gamma| d(C,A) - d(C,B) |,
$$
They suggested $\alpha_A + \alpha_B + \beta=1$, $\alpha_A = \alpha_B$, $\gamma=0$ and $\beta < 1$, although you could try other choices. E.g., single linkage is $\alpha_A = \alpha_B = 1/2$, $\beta=0$, $\gamma=-1/2$ (contracting the space), complete linkage is the same except for $\gamma=1/2$ (diluting the space), and Ward's method is $\alpha_A = (n_A + n_C)/(n_A + n_B + n_C)$, $\alpha_B = (n_B + n_C)/(n_A + n_B + n_C)$, $\beta= - n_C/(n_A + n_B + n_C)$, and $\gamma=0$ (somewhat space-contracting). The book discusses the properties of the algorithms, such as monotonicity and space contraction/dilution as functions of these parameters.
So an algorithm can be build just based on the distances, and does not need the data matrix. However, the distance matrix for typical problems (sample size $n$ > dimension of the data $p$) takes more memory than the data matrix, so this may not be a very efficient way of approaching the problem, unless of course the data matrix is all you have.
Best Answer
Be careful when mixing arbitrary distance functions with k-means.
K-means does not use Euclidean distance. That is a common misconception. K-means assigns points so that the variance contribution is minimized. I.e. $(x_i - \mu_i)^2$ for all dimensions $i$. But if you sum up all these contributions, you get squared Euclidean distance, and since $\sqrt{}$ is monotone, you can just as well assign to the closest neighbor by Euclidean distance (not computing the square roots is faster, though).
The bigger issue when mixing k-means with other distance functions actually is the mean. The way k-means updates the mean works for variance. I.e. the mean is the best estimation to minimize total variance. But that does not imply it will be the best estimation for minimizing an arbitrary other distance function! (see e.g. this counter-example, where the mean is suboptimal for EMD, and counter-example for absolute pearson correlation)
Usually, in situations where you would want to use a different distance function than Euclidean distance - for example because of high dimensionality or discrete data - you will not want to use k-means for the very same reasons. For example, because the mean does not make much sense if you have sparse vectors, or binary vectors (because it won't be binary).
For other distance functions, have a look at k-medoids.