K-Means Clustering – Quantitative Evaluation Metrics

clusteringk-means

I'm using k-means to cluster sentences according to the part-of-speech tags of the words in a sentence, and I have a nice, easy to understand visualization of the result, but I'm struggling to find a good method to quantify the result.

My starting point is a paper by Dowty which postulates that there is a certain fixed set of verb themes (e.g. causation, movement) which are supposedly different semantically and syntactically. To check this claim, I've done k-means clustering (k=8) on a large corpus of part-of-speech tagged sentences. Then, I took a small number (~50) of sentences from each of the resulting clusters, shuffled and hand-labelled them. With this, I made the following visualization of the label assignments per cluster:

cluster assignments chart

Now what I'm looking for is a way to compute the quality/usefulness of the clustering result given the distribution of labels. I'm looking for a value that should be high when most of any label ends up in few of the clusters, and 0 when the labels are equally distributed over the clusters. I have looked into Shannon entropy but I'm not sure if it is what I'm looking for conceptually, and not sure where else to look.

Any clues would be much appreciated!

Best Answer

Have you had a look at the cluster analysis article in Wikipedia?

There is a whole section on external cluster evaluation measures. This seems to be exactly what you are looking for.

https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation