K-Means Clustering – Quantitative Evaluation Metrics

I'm using k-means to cluster sentences according to the part-of-speech tags of the words in a sentence, and I have a nice, easy to understand visualization of the result, but I'm struggling to find a good method to quantify the result.

My starting point is a paper by Dowty which postulates that there is a certain fixed set of verb themes (e.g. causation, movement) which are supposedly different semantically and syntactically. To check this claim, I've done k-means clustering (k=8) on a large corpus of part-of-speech tagged sentences. Then, I took a small number (~50) of sentences from each of the resulting clusters, shuffled and hand-labelled them. With this, I made the following visualization of the label assignments per cluster:

cluster assignments chart

Now what I'm looking for is a way to compute the quality/usefulness of the clustering result given the distribution of labels. I'm looking for a value that should be high when most of any label ends up in few of the clusters, and 0 when the labels are equally distributed over the clusters. I have looked into Shannon entropy but I'm not sure if it is what I'm looking for conceptually, and not sure where else to look.

Any clues would be much appreciated!

K-Means Clustering – Quantitative Evaluation Metrics

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Combine Clustering and classification

Solved – How to evaluate a clustering/unsupervised learning problem with massive amounts of data, with labels only for a small fraction of points

Related Question