Solved – Deriving K-means algorithm as a limit of Expectation Maximization for Gaussian Mixtures

convergenceexpectation-maximizationexpected valuemaximum likelihoodself-study

Christopher Bishop defines the expected value of the complete-data log likelihood function (i.e. assuming that we are given both the observable data X as well as the latent data Z) as follows:

$$
\mathbb{E}_\textbf{Z}[\ln p(\textbf{X},\textbf{Z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})] = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk})\{\ln \pi_k + \ln \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\} \tag 1
$$

where $\gamma(z_{nk})$ is defined as:

$$
\frac{\pi_k \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \tag 2
$$

The idea, as described, is to consider a Gaussian Mixture Model in which the covariance matrices of the mixture components are given by $\epsilon \textbf{I}$, where $\epsilon$ is a variance parameter that is shared by all of the components, such that:

$$
p(\textbf x \mid \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac{1}{(2 \pi \epsilon)^\frac{M}{2}} \exp\big\{{-\frac{1}{2 \epsilon} \|\textbf x – \boldsymbol \mu_k\|^2}\big\} \tag 3
$$

and so, $\gamma(z_{nk})$ is now defined as:

$$
\frac{\pi_k \exp\{ – \| \textbf x_n – \boldsymbol \mu_k\|^2 / 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ – \| \textbf x_n – \boldsymbol \mu_j\|^2 / 2 \epsilon\}} \tag 4
$$

The argument now is the following:

if we consider the limit $\epsilon \to 0$, we see that in the denominator the term for which $\| \textbf x_n – \boldsymbol \mu_j\|^2$ is smallest, will go to zero most slowly, and hence the responsibilities $\gamma(z_{nk})$ for the data point $\textbf x_n$ all go to zero except for term j, for which the responsibility $\gamma(z_{nk})$ will go to unity. Thus, in this limit, we obtain a hard assignment of data points to clusters, just as in the $K$-means algorithm, so that $\gamma(z_{nk}) \to r_{nk}$

where $r_{nk}$ is defined as:

\begin{equation*}
f(n) = \begin{cases}
1 & \text{if } k = \text{arg } \text{min}_j \|\textbf x_n – \boldsymbol \mu_j\|^2\\
0 & \text{otherwise}\\ \tag 5
\end{cases}
\end{equation*}

My question is how does the above argument hold? Namely, what does it mean for a term to go to zero $\textbf{most slowly}$ ? And how does taking the limit $\epsilon \to 0$ in eqn $4$ result in a binary responsibility?

Best Answer

Let us write $$\|\textbf x_n - \boldsymbol \mu_k\|^2=\delta_k\,.$$ Then $$\frac{\pi_k \exp\{ - \| \textbf x_n - \boldsymbol \mu_k\|^2 / 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \| \textbf x_n - \boldsymbol \mu_j\|^2 / 2 \epsilon\}}=\frac{\pi_k \exp\{ - \delta_k/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \delta_j/ 2 \epsilon\}}$$ If we take$$\delta^*=\min_n\delta_n\,,$$ we have \begin{align*} \frac{\pi_k \exp\{ - \delta_k/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \delta_j/ 2 \epsilon\}}&=\frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}} \end{align*} where $\delta^*-\delta_k<0$ except for $k=k^*$ where $\delta^*-\delta_{k^*}=0$. So, for all $k\ne k^*$, $$\lim_{\epsilon\to 0} \frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=\lim_{\epsilon\to 0} \frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\pi_{k^*}+\sum_{j\ne k^*} \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=0$$ since, for $a>0$, $$\lim_{\epsilon\to 0}\exp\{-a/\epsilon \}=0$$ while $$\lim_{\epsilon\to 0} \frac{\pi_{k^*} \exp\{(\delta^*- \delta_{k^*})/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=\lim_{\epsilon\to 0} \frac{\pi_{k^*} \times 1}{\pi_{k^*}+\sum_{j\ne k^*} \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=1$$

Related Question