Solved – Silhouette coefficients after deleting some data and re-clustering

clusteringk-meansoutliersself-study

I was working on a dataset today on which I used K-means clustering algorithm and then calculated the Silhouette coefficients for each point. I then removed 5% of the data with worst silhouette coefficients and re-clustered again. The average silhouette coefficient for each cluster got worse after re-clustering and removal of those outliers (I expected opposite). Am I getting the right answer or have I made a blatant mistake? The results I got don't seem intuitive to me.

Best Answer

That's a good question. The value of Silhouette index for an object shows how strongly is justified the decision to assign the object to its actual cluster over the decision to assign it to another cluster, closest to it. Value tending to 1 tells of hight justifiedness (well clustered object). Negative value tells that the object should better belong to that another cluster. Value close to zero is characteristic of a "borderline", between the two clusters, object.

In real data even optimal clusterization will leave some objects to be with low positive value because neighbour clusters usually "touch" each other by their borders. Unless the value isn't negative there is no reason to reassign the object (though you may do it, and sometimes it will enhance clusters). Nor there is reason to delete objects with low positive values. Deleting borderline objects may not help: re-clustering after the deletion will redefine clusters and make other points borderline in place of the deleted ones, so you are not guaranteed to better the overall cluster solution. In addition, deleting is a gross intervention in real data, and you must have strong reason to treat intermediate points like outliers.

Also, you should take into consideration that original Silhouette index which you probably use (Kaufman, L., Rousseeuw, P. Finding groups in data: an introduction to cluster analysis. New York, 1990) is based on averaged pairwise distances, whereas K-means clustering tries to minimize deviations from cluster centre. Thus, that index is not very good a judge for K-means. One should re-define terms of the Silhouette index formula to be about deviations from centres - then the index is more appropriate for K-means (if you use SPSS you may find a program to compute such modified Silhouette on my web-page).

Dependency of K-means on the choice of initial cluster centres should also be remembered here, as @user603 points in their comment.

Related Question