Solved – cluster analysis, Ward: how to evaluate number of clusters and their quality

clustering

I have a table of similarities (cosines) and I clustered it with the Ward method. Great outcomes, a wonderful dendogram, but then I tried to evaluate the quality of this cluster solution and I got stuck.

First: identifying the number of clusters in my data (cause in Ward is not like k-means where you have to set a precise number of clusters). I calculated the sum of squares (see attachment) to see how many clusters are there, but there isn't a proper "elbow" in the data, so how many clusters shall I consider?

Second: trying to calculate the purity of the clustering (with the tool CluTo), by indicating 4, 5, 6, 7… clusters, I can see that the purity increases the more clusters I indicate. Of course. If the number of clusters equals the number of instances of my data, then purity is 1 (the maximum). dah.

Any suggestion on how to report this? (number of clusters? quality of the clustering solution?)

enter image description here

Best Answer

If you are using R, this is a really nice page (http://www.statmethods.net/advstats/cluster.html) that steps though a few different methods to help in identifying the optimal number of clusters. HTH.