Expectation-Maximization – Special Case of the EM Algorithm for Exponential Family Distributions

expectation-maximizationexponential-family

According to Wikipedia, the formal definition of the EM algorithm is

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

Expectation step (E step): Define $Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})$ as the expected value of the log likelihood function of $\boldsymbol\theta$, with respect to the current conditional distribution of $\mathbf{Z}$ given $\mathbf{X}$ and the current estimates of the parameters $\boldsymbol\theta^{(t)}$:
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta; \mathbf{X},\mathbf{Z}) \right] \,
$$

Maximization step (M step): Find the parameters that maximize this quantity:
$$
\boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \,
$$

In instances where $p(\mathbf{x},\mathbf{z})$ is an exponential family distribution, I have found that authors used
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \log L (\boldsymbol\theta; \mathbf{X},\operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[\mathbf{Z}\right]) \,
$$

instead of
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta; \mathbf{X},\mathbf{Z}) \right] \,
$$

where $\operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[\mathbf{Z}\right]$ is the estimated data. Are these two equations equivalent?

Best Answer

Suppose we want to estimate the parameter $\theta$ of the distribution $p(z;\theta)$ of a random variable $Z$. If we had samples $z_1, z_2, \dots, z_N$, we could easily do this using maximum likelihood estimation. However, suppose instead we are given the samples $x_1,x_2,\dots,x_N$ of another random variable $X$, where $X$ and $Z$ are related via the joint distribution $p(x,z)$. Since $p(x,z) = p(x \mid z) \cdot p(z;\theta)$, then $p(x,z) = p(x,z;\theta)$.

We could then define the complete-data likelihood as $$ p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) $$ We can also assume that each pair $(x_i,z_i)$ is independent of and identically distributed to every other pair for $i \in \{1,2,\dots,N\}$ such that $$ p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) = \prod_{i=1}^N p(x_i,z_i;\theta) $$ and $$ \log p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) = \sum_{i=1}^N \log p(x_i,z_i;\theta) $$ Next, let $$ p(x_i,z_i;\theta) = h(x_i,z_i) \cdot \exp(\eta (\theta) \cdot T(x_i,z_i) + A(\theta)) $$ such that $$ \log p(x_i,z_i;\theta) = \log h(x_i,z_i) + \eta (\theta) \cdot T(x_i,z_i) + A(\theta) $$ Then, note that \begin{align} \mathbb{E}\left[\log p(x_i,z_i;\theta) \mid x_i\right] &= \mathbb{E}\left[\log h(x_i,z_i) \mid x_i\right] + \eta (\theta) \cdot \mathbb{E}\left[T(x_i,z_i) \mid x_i\right] + A(\theta) \\ \log p(x_i,\mathbb{E}\left[z_i \mid x_i\right];\theta) &= \log h(x_i,\mathbb{E}\left[z_i \mid x_i\right]) + \eta (\theta) \cdot T(x_i,\mathbb{E}\left[z_i \mid x_i\right]) + A(\theta) \end{align} and so, in general, $$ \mathbb{E}\left[\log p(x_i,z_i;\theta) \mid x_i\right] \neq \log p(x_i,\mathbb{E}\left[z_i \mid x_i\right];\theta) $$

Related Question