The Ward clustering algorithm is a hierarchical clustering method that minimizes an 'inertia' criteria at each step. This inertia quantifies the sum of squared residuals between the reduced signal and the initial signal: it is a measure of the variance of the error in an l2 (Euclidean) sens. Actually, you even mention it in your question. This is why, I believe, it makes no sens to apply it to a distance matrix that is not a l2 Euclidean distance.
On the other hand, an average linkage or a single linkage hierarchical clustering would be perfectly suitable for other distances.
See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).
Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.
I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.
The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution.
Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.
Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.
And in the end go and try out stuff. It's data exploration, not "return(truth);
". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.
Best Answer
For example, the Calinski-Harabasz variance ratio criterion (VRC) is fairly standard.
But there are many many more, such as C index, DBCV, etc.
I believe some of the indexes even had a dozen variants.
The Dunn index is essentially the ratio separation/compactness, while davies-bouldin is a compactness/separation. So I guess you are suggesting just one of the many variants of these two.
Note that if you have many clusters, it is better to only consider the nearby neighbor clusters, and not the average distance to all others! Assuming you have one very badly split cluster, but extremely well separated from the majority of the data, the naive within/inbetween quotient will fail. That is why you usually define separation based in the nearest other cluster(s) only, instead of the entire data set.
It just shows once more that you cannot rely on Wikipedia alone (and too many people and even books just copy from Wikipedia only...)
But beware that all these are just heuristics. You can find counterexamples for each, I suppose.