Solved – K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$

expectation-maximizationgaussian mixture distributionk-meansself-study

My goal is to see that K-means algorithm is in fact Expectation-Maximization algorithm for Gaussian mixtures in which all components have covariance $\sigma^2 I$ in the limit as $\lim_{\sigma \to 0}$.

Suppose we have a data set $\{x_1, \dots ,x_N\}$ of observations of random variable $X$.
The objective function for M-means is given by:
$$J = \sum_{n=1}^{N}\sum_{k=1}^{K} r_{nk} ||x_n – \mu_k ||^2$$
where $r_{nk}$ is binary indicator variable of a hard assignment of $x_n$ to cluster $k$.
(if data point $x_n$ is assigned to cluster $k$, then $r_{nk} = 1$ and $r_{nj} = 0$ for $j \ne$ k).
K-means algorithm minimize $J$ through iteration until convergence, that involves two successive steps:
(E) minimize $J$ with respect to $\{r_{nk}\}_{n,k}$ keeping all $\mu_k$ fixed
(M) minimize $J$ with respect to $\{\mu_k\}_k$ keeping all $r_{nk}$ fixed

In general, denoting all observed data by $X$, all latent variables by $Z$ and set of all model parameters by $\theta$, the EM algorithm maximize posterior distribution $p(\theta | X)$ through iteration until convergence, of two alternating steps:
(E) calculate the expectation $Q(\theta, \theta^{\text{old}}) := \sum_{Z}p(Z | X, \theta^{\text{old}})\log p(Z,X|\theta)$
(M) find $\theta^{\text{new}} = \arg \max_{\theta} Q(\theta, \theta^{\text{old}})$

Now consider the Gaussian mixture distribution:
$$p(x) = \sum_{k=1}^K \pi_k N(x | \mu_k, \Sigma_k)$$
Introducing a latent $K$-dimensional binary random variable $z$ by $p(z_k = 1) = \pi_k$, we see that:
$$p(X, Z) = \prod_{n=1}^N\prod_{k=1}^K \pi_k^{z_{nk}} N(x_n | \mu_k, \Sigma_k)^{z_{nk}}$$
$$\gamma(z_k) := p(z_k = 1 | x) = \frac{\pi_k N(x| \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j N(x | \mu_j, \Sigma_j)}$$
$$\log p(X,Z | \mu, \Sigma, \pi) = \sum_{n=1}^N \sum_{k=1}^K z_{nk}(\log \pi_k + \log N(x_n| \mu_k, \Sigma_k))$$
$$\mathbb{E}(z_{nk}) = \gamma(z_{nk})$$
So $$Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}}) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk})(\log \pi_k + \log N(x_n| \mu_k, \Sigma_k))$$

If now all Gaussians in the mixture model have covariance $\sigma^2 I$, considering the limit $\sigma \to 0$ I can easily show that $\gamma(z_{nk}) \to r_{nk}$ where $r_{nk}$ is as defined above.
So indeed the (E) step updates $r_{nk}$ as in the K-means algorithm.

However, I have problem with maximizing $Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}})$ in this context, as for $x \ne \mu$
$\lim_{\sigma \to 0} log(N(x|\mu,\sigma^2)) = – \infty$.
Is it true, that up to some constant and scalar multiplication:
$\lim_{\sigma \to 0} Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}}) = -J$ ?

Maybe I'm missing something. Any advice?

Best Answer

Is it true that up to some constant and scalar multiplication: $\lim_{\sigma \to 0} Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}}) = -J$?

This is not the case since – as you observed yourself – the limit diverges.

However, if we first transform $Q$ and then take the limit, we converge to the k-means objective. For $\Sigma_k = \sigma^2 I$ and $\pi_k = 1/K$ we have

\begin{align} Q &= \sum_{n,k} \gamma_{nk} \left( \log \pi_k + \log N(x_n \mid \mu_k, \Sigma_k) \right) \\ &= N \log\frac{1}{K} - \frac{1}{\sigma^2} \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 - N \frac{D}{2} \log 2\pi\sigma^2. \end{align}

Multiplying by $\sigma^2$ (which does not affect the EM algorithm, since $\sigma$ is not optimized but constant) and collecting all the constant terms in $C$, we see that \begin{align} Q &\propto - \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 + \sigma^2 C. \end{align} Note that maximizing this function with respect to $\mu$ for any $\gamma$ and $\sigma$ gives the same result as the objective function above, i.e., it is an equivalent formulation of the M-step. But taking the limit now yields $-J$.


As an aside, an in my view slightly more elegant formulation of EM is to use the objective function \begin{align} F(\mu, \gamma) &= \sum_{n,k} \gamma_{nk} \log \pi_k N(x_n \mid \mu_k, \Sigma_k)/\gamma_{nk} \\ &\propto -\sum_{n,k} \sum_{n, k} \gamma_{nk} ||x_n - \mu_k||^2 - \sigma^2 \sum_{n,k} \gamma_{nk} \log \gamma_{nk} + \sigma^2 C. \end{align} Using this objective function, the EM algorithm amounts to alternating between optimizing $F$ with respect to $\mu$ (M-step) and $\gamma$ (E-step). Taking the limit we see that both the M-step and the E-step converge to the k-means algorithm.

See also an alternative view of EM.

Related Question