Clustering – Understanding Why K-Means Algorithm Does Not Always Find the Global Minimum

clusteringconvergenceextreme valuegradient descentk-means

I read that the k-means algorithm only converges to a local minimum and not to a global minimum. Why is this? I can logically think of how initialization could affect the final clustering and there is a possibility of sub-optimum clustering, but I did not find anything that will mathematically prove that.

Also, why is k-means an iterative process? Can't we just partially differentiate the objective function w.r.t. to the centroids, equate it to zero to find the centroids that minimizes this function? Why do we have to use gradient descent to reach the minimum step by step?

Best Answer

You can see k-means as a special version of the EM algorithm, which may help a little.

Say you are estimating a multivariate normal distribution for each cluster with the covariance matrix fixed to the identity matrix for all, but variable mean $\mu_i$ where $i$ is the cluster's index. Clearly, if the parameters $\{\mu_i\}$ are known, you can assign each point $p$ its maximum likelihood cluster (ie. the $\mu_i$ for which the distance to $p$ in minimal). The EM algorithm for this problem is almost equivalent to k-means.

The other way around, if you know which points belong to which cluster, you can estimate the optimal $\mu_i$. The closed form solution to this (which finds a global optimum) basically says that to find the maximum likelihood models $\{\hat\mu_i\}$ you integrate over all possible assignments of points to clusters. Since even with just thirty points and two clusters, there are about a billion such possible assignments, this is unfeasible to calculate.

Instead, we can take some guess as to the hidden parameters (or the model parameters) and iterate the two steps (with the posibility of ending up in a local maximum). If you allow each cluster to take a partial responsibility for a point, you end up with EM, if you just assign the optimal cluster, you get k-means.

So, executive summary: in probabilistic terms, there is a global solution, but it requires you to iterate over all possible clusterings. Clearly if you have an objective function, the same is true. You could iterate over all solutions and maximize the objective function, but the number of iterations is exponential in the size of your data.

Related Question