R – Fuzzy K-means: Understanding Cluster Sizes and Their Implications

clusteringfuzzyk-meansr

I'm trying to do fuzzy k-means clustering on a dataset using the cmeans function (R) . The problem Im facing is that the sizes of clusters are not as I would like them to be. This is done by calculating the cluster to which the observations are "closest".

cl$size
 [1]   108    31   192    51   722 18460    67  1584   419 17270

Here we see that for 10 clusters we have two huge clusters and a lot of very small ones. Does this imply that two clusters are optimal in any way? If I do regular K-means 10 segments look very well, with good sizes and their intepretation makes a lot of sense but I would like to try fuzzy correctly. I just started exploring this fuzzy clustering so any help and pointers are overly welcome.

Best Answer

K-means and also fuzzy k-means (emphasized by your "the winner takes it all" strategy) assume that clusters have the same spatial extend.

This is best explained by looking at an object $o$ almost half-way between cluster centers $c_i$ and $c_j$. If it is slightly closer to $c_i$, it will go into cluster $i$, if it is slightly closer to $c_j$ it will go into $j$. I.e. k-means assumes that splitting the dataset at the hyperplane orthogonal on the mean between the two points (i.e. the Voronoi cell boundary) is the proper split and does not at all take into account that clusters may have different spatial extend.

EM clustering (Wikipedia) with Gaussian mixtures is essentially an extension of Fuzzy k-means that does a) not assume all dimensions are equally important and b) the clusters may have a different spatial extend. You could simplify it by removing the Gaussian Mixture (or at least the covariances), and just keep a cluster weight. This weight would essentially give the relative cluster size.

Or you might want to look into more advanced methods for arbitrary shaped clusters such as DBSCAN (Wikipedia) and OPTICS (Wikipedia).

Related Question