Solved – How to evaluate “external” quality of clustering

clusteringdata mining

Let's say you want to cluster some objects, say documents, or sentences, or images.

On the technical side, you first represent these object somehow so that you could calculate distance between them, and then you feed those representations to some clustering algorithm.

Externally, however, you just want to group similar (in some sense — and that's where things become pretty vague for me) objects together. For example, in case of sentences we want for clusters to contain sentences about similar topic/concept; we feel that sentences "oh look at this pic of a cute lolcat" and "facebook revealed new shiny feature tonight" should be in different clusters.

What are the usual approaches for measuring this "external" quality of clustering? I.e. we want to measure how well our clustering procedure groups initial objects (sentences, images); we're not interested in internal measures (like averaged cluster radius, clusters sparseness), since those measures deal with objects' representations, not with real objects. Meaning, the chosen representation may be awful, and even if internal measures is great, externally we'll end up with clusters that are complete junk from our vague, subjective, "some sense"-ish point of view.

P.S. Having limited knowledge in clustering domain, I suspect I may be asking about really obvious thing, or my terminology may sound strange to clustering experts. If so, please advice what should I read on the subject.

P.P.S. Just in case, I asked the very same question on Quora: http://www.quora.com/How-to-evaluate-external-quality-of-clustering

Best Answer

I do not perfectly understand what you mean by internal and external quality. I assume that internal refers to a measure computed on the obtained partition while external is the result that you would like to obtain.

Usually, internal measure aims at comparing the within cluster distance compared to the distance between the cluster. Intuitively, if clusters are dense and well separated, then you have a good clustering. As this is the objective of clustering, you cannot really do better, unless you ask people to look at your partitions and say whether or not they are good.

If the resulting clustering does not seems good to you, it is probably that either your points are not correctly placed or your distance is not adapted to your problem. For example, suppose that your expected clusters form long parallel rectangle in your representation. If you use an euclidean distance, you won't be able to find the expected partition.

To solve this problem, if in the resulting partition, you find that their is points in the same cluster that should not belong together, then ask yourself why the chosen distance considered them as close. Then, just build (or read about) a new distance function that avoid this problem.

To sum up, if you find that the computed partition does not make sense, it is not necessarily because your quality measure is wrong, but more likely because the clustering performed the wrong task. Finding a good distance and space representation is probably the main task when doing clustering.