Solved – K-means for non-spherical (non-globular) clusters

k-means

It is said that K-means clustering "does not work well with non-globular clusters."

However, is this a hard-and-fast rule – or is it that it does not often work?

I have a 2-d data set (specifically depth of coverage and breadth of coverage of genome sequencing reads across different genomic regions cf. bioinformatics). In short, I am expecting two clear groups from this dataset (with notably different depth of coverage and breadth of coverage) and by defining the two groups I can avoid having to make an arbitrary cut-off between them.

The procedure appears to successfully identify the two expected groupings, however the clusters are clearly not globular. Is this a valid application? I am not sure whether I am violating any assumptions (if there are any?), or whether it is just that k-means often does not work with non-spherical data clusters.

If the question being asked is, is there a depth and breadth of coverage associated with each group which means the data can be partitioned such that the means of the members of the groups are closer for the two parameters to members within the same group than between groups, then the answer appears to be yes. But is it valid? Or is it simply, if it works, then it's ok?

(Apologies, I am very much a stats novice.)

Edit: below is a visual of the clusters. The breadth of coverage is 0 to 100 % of the region being considered. The depth is 0 to infinity (I have log transformed this parameter as some regions of the genome are repetitive, so reads from other areas of the genome may map to it resulting in very high depth – again, please correct me if this is not the way to go in a statistical sense prior to clustering).

Clusters (breadth/depth of coverage

As you can see the red cluster is now reasonably compact thanks to the log transform, however the yellow (gold?) cluster is not. Despite this, without going into detail the two groups make biological sense (both given their resulting members and the fact that you would expect two distinct groups prior to the test), so given that the result of clustering maximizes the between group variance, surely this is the best place to make the cut-off between those tending towards zero coverage (will never be exactly zero due to incorrect mapping of reads) and those with distinctly higher breadth/depth of coverage. The algorithm converges very quickly <10 iterations.

Thanks,
Jon.

Best Answer

K-means will not perform well when groups are grossly non-spherical. I highly recomend this answer by David Robinson to get a better intuitive understanding of this and the other assumptions of k-means.

K-means does not perform well when the groups are grossly non-spherical because k-means will tend to pick spherical groups. Tends is the key word and if the non-spherical results look fine to you and make sense then it looks like the clustering algorithm did a good job. What matters most with any method you chose is that it works.

Related Question