Solved – Verifying the output of implementing internal clustering validity indexes

clusteringjavametric

I have implemented some internal clustering validity indexes in Java:

  1. Simplified Silhouette.
  2. Calinski-Harabasz (VRC).
  3. Davies -Bouldin.
  4. Dunn's Index.

How could I verify if my implementation is correct?

I have tested the indexes on Iris, Wine, Ionosphere, Heart, Sonar, Zoo and Glass benchmarks.

I used K-Means algorithm with different number of clusters from 2 to 8.

The problem is:
I obtain the best scores in partitions with 2 clusters in most of the cases.
In Zoo and Glass datasets, in which the real number of clusters is 7, only one of the indexes scores the best in the case where k=7.

If it's important to mention:

  • K-Means (Trickl-Cluster's Implementation) results are identical to Weka's output (tested on iris dataset).
  • The calculation of the centroids (means of the clusters) is almost identical to Weka's output.
  • The used API to perform calculations on Matrices is Colt (computing the norm, operations on matrices, distances between clusters centroids…).

What's wrong?

Best Answer

Class labels aren't the same as clusters.

If you look at e.g. the iris data set, it's fairly obvious that the best solution will have just 2 clusters, not three. Plot the unlabeled data and interview some people on the number of groups they see in this data set. If you set k to three, you will often get results like this, where the wrong cluster is split.

The problem is that there is a mismatch between these measures (which measure some mathematical properties) and reality. In reality, classes may consist of multiple clusters, and classes may cluster themselves. Your data may just lack the information to clearly show the structure that someone manually annotated.

Plus, preprocessing is essential. Preprocess your data differently, and both your clustering algorithms will produce substantially different results, and the score you computed will also usually be quite different (at least for any index that is distance based!)

If you want to compare a clustering results with existing class labels, it's best to use an external evaluation measure, instead of an internal evaluation measure.

Have you tried computing your measures on the "true" clusters? I wouldn't be surprised if most of the time, the results produced by the clustering algorithms score better on each of these measures!

For clustering with Java I mostly use ELKI. It's really fast, and it has plenty of algorithms, not just the 3 standard algorithms from the 70s that everybody has. But I don't think it currently has internal evaluation (it has some 20 measures for external evaluation though). I'm sure they would appreciate if someone contributes such internal evaluation indexes! Maybe if you contribute your code there, they will help you verify the implementation. Maybe they also have some of the measures also implemented somewhere already, and I just didn't find them.

Related Question