Solved – Motivation of Expectation Maximization algorithm

expectation-maximizationmixture-distribution

In the EM algorithm approach we use Jensen's inequality to arrive at $$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz – \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$$

and define $\theta^{(k+1)}$ by $$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$$

Everything I read EM just plops it down but I've always felt uneasy by not having an explanation of why the EM algorithm arises naturally. I understand that $\log$ likelihood is typically dealt with to deal with addition instead of multiplication but the appearance of $\log$ in the definition of $\theta^{(k+1)}$ feels unmotivated to me. Why should one consider $\log$ and not other monotonic functions? For various reasons I suspect that the "meaning" or "motivation" behind expectation maximization has some kind of explanation in terms of information theory and sufficient statistics. If there were such an explanation that would be much more satisifying than just an abstract algorithm.

Best Answer

Likelihood vs. log-likelihood

As has already been said, the $\log$ is introduced in maximum likelihood simply because it is generally easier to optimize sums than products. The reason we don't consider other monotonic functions is that the logarithm is the unique function with the property of turning products into sums.

Another way to motivate the logarithm is the following: Instead of maximizing the probability of the data under our model, we could equivalently try to minimize the Kullback-Leibler divergence between the data distribution, $p_\text{data}(x)$, and the model distribution, $p(x \mid \theta)$,

$$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$$

The first term on the right-hand side is constant in the parameters. If we have $N$ samples from the data distribution (our data points), we can approximate the second term with the average log-likelihood of the data,

$$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$$

An alternative view of EM

I am not sure this is going to be the kind of explanation you are looking for, but I found the following view of expectation maximization much more enlightening than its motivation via Jensen's inequality (you can find a detailed description in Neal & Hinton (1998) or in Chris Bishop's PRML book, Chapter 9.3).

It is not difficult to show that

$$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$$

for any $q(z \mid x)$. If we call the first term on the right-hand side $F(q, \theta)$, this implies that

$$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$$

Because the KL divergence is always positive, $F(q, \theta)$ is a lower bound on the log-likelihood for every fixed $q$. Now, EM can be viewed as alternately maximizing $F$ with respect to $q$ and $\theta$. In particular, by setting $q(z \mid x) = p(z \mid x, \theta)$ in the E-step, we minimize the KL divergence on the right-hand side and thus maximize $F$.