I was working on a dataset today on which I used K-means clustering algorithm and then calculated the Silhouette coefficients for each point. I then removed 5% of the data with worst silhouette coefficients and re-clustered again. The average silhouette coefficient for each cluster got worse after re-clustering and removal of those outliers (I expected opposite). Am I getting the right answer or have I made a blatant mistake? The results I got don't seem intuitive to me.
Solved – Silhouette coefficients after deleting some data and re-clustering
clusteringk-meansoutliersself-study
Related Solutions
The ASW is a measure of the coherence of a clustering solution. A high ASW value means that the clusters are homogeneous (all observations are close to cluster center), and that they are well separated. According to Kaufmann and Rousseuw (1990), a value below 0.25 means that the data are not structured. Between 0.25 and 0.5, the data might be structured, but it might also be an artifice. Please keep in mind that these values are indicative and should not be used as a decision threshold. These values are not theoretically defined (there are not based on some p-value) but are based on the experience of the authors. Hence, according to these low ASW values, your data seems to be quite unstructured. If the purpose of the cluster analysis is only descriptive, then you can argue that it reveals some (but only some) of the most salient patterns. However, I think that in your case, you should not draw any theoretical conclusions from your clustering.
You can also try to have a look at the "per cluster" ASW values (this is given by the function wcClusterQuality
). Maybe some of your clusters are well-defined and some may be "spurious" (ASW<0), resulting in a low overall ASW value.
You can try to use bootstrap strategies, which should give you a better hint. In R, the function clusterboot
in the package fpc
can be used for this purpose (look at the help page). However, it does not work with weighted data. If your data are unweighted, I think it is worth to give it a try.
Finally, you may want to have a closer look at your data and your categorization. Maybe, your categories are too instable or not well defined. However, it does not seem to be the case here.
As you have said, "lack of clearly differentiated clusters is not the same thing as a lack of interesting variation". There are other methods to analyse the variability of your sequences such as discrepancy analysis. These methods allow you to study the links between sequences and explanatory factors. You may, for instance, try to build sequence regression trees (function "seqtree" in package TraMineR).
K means is very sensitive to feature scaling. So it's hard to get mathematically meaningful results when your axes have different units!
Furthermore, the PCA plot is likely misleading. It is probably showing some artifact of your data. There probably is some attribute with very high variance and just three levels (e.g. income 1000, 10000, 100000) that dominates this visualization in an undue way. The other attributes then just add a "gaussian blur" to this. So all you are doing is reverse engineer that attribute that already is in your data. My guess is that you can simply identify it by looking at the highest variance attribute.
Best Answer
That's a good question. The value of Silhouette index for an object shows how strongly is justified the decision to assign the object to its actual cluster over the decision to assign it to another cluster, closest to it. Value tending to 1 tells of hight justifiedness (well clustered object). Negative value tells that the object should better belong to that another cluster. Value close to zero is characteristic of a "borderline", between the two clusters, object.
In real data even optimal clusterization will leave some objects to be with low positive value because neighbour clusters usually "touch" each other by their borders. Unless the value isn't negative there is no reason to reassign the object (though you may do it, and sometimes it will enhance clusters). Nor there is reason to delete objects with low positive values. Deleting borderline objects may not help: re-clustering after the deletion will redefine clusters and make other points borderline in place of the deleted ones, so you are not guaranteed to better the overall cluster solution. In addition, deleting is a gross intervention in real data, and you must have strong reason to treat intermediate points like outliers.
Also, you should take into consideration that original Silhouette index which you probably use (Kaufman, L., Rousseeuw, P. Finding groups in data: an introduction to cluster analysis. New York, 1990) is based on averaged pairwise distances, whereas K-means clustering tries to minimize deviations from cluster centre. Thus, that index is not very good a judge for K-means. One should re-define terms of the Silhouette index formula to be about deviations from centres - then the index is more appropriate for K-means (if you use SPSS you may find a program to compute such modified Silhouette on my web-page).
Dependency of K-means on the choice of initial cluster centres should also be remembered here, as @user603 points in their comment.