I do not perfectly understand what you mean by internal and external quality. I assume that internal refers to a measure computed on the obtained partition while external is the result that you would like to obtain.
Usually, internal measure aims at comparing the within cluster distance compared to the distance between the cluster. Intuitively, if clusters are dense and well separated, then you have a good clustering. As this is the objective of clustering, you cannot really do better, unless you ask people to look at your partitions and say whether or not they are good.
If the resulting clustering does not seems good to you, it is probably that either your points are not correctly placed or your distance is not adapted to your problem. For example, suppose that your expected clusters form long parallel rectangle in your representation. If you use an euclidean distance, you won't be able to find the expected partition.
To solve this problem, if in the resulting partition, you find that their is points in the same cluster that should not belong together, then ask yourself why the chosen distance considered them as close. Then, just build (or read about) a new distance function that avoid this problem.
To sum up, if you find that the computed partition does not make sense, it is not necessarily because your quality measure is wrong, but more likely because the clustering performed the wrong task. Finding a good distance and space representation is probably the main task when doing clustering.
The problem, in particular with k-means applied to real world, labeled data is that clusters will usually not agree with your labels very well, unless you either generated the labels by using a similar clustering algorithm (self-fulfilling prophecy), or the data set is really simple.
Have you tried computing the k-means-statistics such as sums of squares etc. on the original data set? I would not at all be surprised if they are substantially worse than after running k-means.
I figure it's just another case of the algorithm does not fit to your problem.
Evaluating clustering algorithms is really hard. Because you actually want to find something you do not know yet. Even if the clustering would reproduce the original labels it then actually failed, because it did not tell you something new, and then you could just have used the labels instead.
Maybe the most realistic evaluation for a clustering algorithm is the following: if you incorporate the result from the clustering algorithm into a classification algorithm, does it improve the classification accuracy significantly? I.e. treat clustering as a preprocessing/support functionality for an algorithm that you can evaluate reasonably.
Best Answer
The choice of metric rather depends on what you consider the purpose of clustering to be. Personally I think clustering ought to be about identifying different groups of observations that were each generated by a different data generating process. So I would test the quality of a clustering by generating data from known data generating processes and then calculate how often patterns are misclassified by the clustering. Of course this involved making assumtions about the distribution of patterns from each generating process, but you can use datasets designed for supervised classification.
Others view clustering as attempting to group together points with similar attribute values, in which case measures such as SSE etc are applicable. However I find this definition of clustering rather unsatisfactory, as it only tells you something about the particular sample of data, rather than something generalisable about the underlying distributions. How methods deal with overlapping clusters is a particular problem with this view (for the "data generating process" view it causes no real problem, you just get probabilities of cluster membership).