The Ward clustering algorithm is a hierarchical clustering method that minimizes an 'inertia' criteria at each step. This inertia quantifies the sum of squared residuals between the reduced signal and the initial signal: it is a measure of the variance of the error in an l2 (Euclidean) sens. Actually, you even mention it in your question. This is why, I believe, it makes no sens to apply it to a distance matrix that is not a l2 Euclidean distance.
On the other hand, an average linkage or a single linkage hierarchical clustering would be perfectly suitable for other distances.
A somewhat complicated answer to your first question about distances goes like this.
Rencher's Multivariate Analysis cites Lance and Williams (1967) who proposed a general (flexible beta) method underlying most of the existing hierarchical cluster analysis: for three objects A, B and C, be that clusters or individual points considered to be clusters of size 1, if clusters A and B are joined into AB, then the new distance from AB to C can be expressed by
$$
d(AB,C) = \alpha_A d(C,A) + \alpha_B d(C,B) + \beta d(A,B) + \gamma| d(C,A) - d(C,B) |,
$$
They suggested $\alpha_A + \alpha_B + \beta=1$, $\alpha_A = \alpha_B$, $\gamma=0$ and $\beta < 1$, although you could try other choices. E.g., single linkage is $\alpha_A = \alpha_B = 1/2$, $\beta=0$, $\gamma=-1/2$ (contracting the space), complete linkage is the same except for $\gamma=1/2$ (diluting the space), and Ward's method is $\alpha_A = (n_A + n_C)/(n_A + n_B + n_C)$, $\alpha_B = (n_B + n_C)/(n_A + n_B + n_C)$, $\beta= - n_C/(n_A + n_B + n_C)$, and $\gamma=0$ (somewhat space-contracting). The book discusses the properties of the algorithms, such as monotonicity and space contraction/dilution as functions of these parameters.
So an algorithm can be build just based on the distances, and does not need the data matrix. However, the distance matrix for typical problems (sample size $n$ > dimension of the data $p$) takes more memory than the data matrix, so this may not be a very efficient way of approaching the problem, unless of course the data matrix is all you have.
Best Answer
No, there's nothing inherently incorrect about doing that. In fact, for binary data, the Manhattan distance and Hamming distance are equivalent. For each variable the distance contribution is either 0 or 1; these contributions are summed over all variables. The Hamming distance explicitly sets these contributions to 0 or 1 as match/mismatch. The Manhattan distance works out to be the same because the underlying data is binary and so the only possible Manhattan distances between two values are 0 and 1.