Solved – Modified K-means with unequal cluster variances

k-means

I wonder how I can modify the K-means algorithm so that the cluster volumes are not equal to each other. The K-means objective is to minimize within cluster sum of squares $\sum_{i=1}^{p} {\parallel \mathit{X}_i-\mathit{L}_{\mathit{Z}_i} \parallel}_2^2$, and this objective assumes that all cluster variances are the same. If we assume that the clusters are Gaussian with mean $\mathit{L}_{\mathit{Z}_i}$ and variance $\sigma_{\mathit{Z}_i}^2$ where $\mathit{Z}_i$ stands for the cluster assignment of data point $i$, then the objective for the cluster assignments become $\sum_{i=1}^{p} \frac {{\parallel \mathit{X}_i-\mathit{L}_{\mathit{Z}_i} \parallel}_2^2} {\sigma_{\mathit{Z}_i}^2}$. So, I tried modifying K-means such that $\mathit{Z}_i$ update is performed using this new update rule, and $\sigma_{\mathit{Z}_i}^2$ are also updated in each iteration. However, when I use this new modified K-means, almost all data points are assigned to the same cluster, which is weird. What might be the problem about that approach? I know EM can be used for this unequal-volume GMM purpose, but I want a simpler approach like K-means, and I am really curious about why what I tried is not feasible. Thanks!

Best Answer

Use Gaussian Mixture Modeling (aka: EM Clustering) instead.

It allows different variances, depending on your model. It can even allow different covariances if you use the most complex models.