Expectation-Maximization – Special Case of the EM Algorithm for Exponential Family Distributions

According to Wikipedia, the formal definition of the EM algorithm is

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

Expectation step (E step): Define $Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})$ as the expected value of the log likelihood function of $\boldsymbol\theta$, with respect to the current conditional distribution of $\mathbf{Z}$ given $\mathbf{X}$ and the current estimates of the parameters $\boldsymbol\theta^{(t)}$:
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta; \mathbf{X},\mathbf{Z}) \right] \,
$$

Maximization step (M step): Find the parameters that maximize this quantity:
$$
\boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \,
$$

In instances where $p(\mathbf{x},\mathbf{z})$ is an exponential family distribution, I have found that authors used
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \log L (\boldsymbol\theta; \mathbf{X},\operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[\mathbf{Z}\right]) \,
$$
instead of
$$
Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta; \mathbf{X},\mathbf{Z}) \right] \,
$$
where $\operatorname{E}_{\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}}\left[\mathbf{Z}\right]$ is the estimated data. Are these two equations equivalent?

Best Answer

Suppose we want to estimate the parameter $\theta$ of the distribution $p(z;\theta)$ of a random variable $Z$. If we had samples $z_1, z_2, \dots, z_N$, we could easily do this using maximum likelihood estimation. However, suppose instead we are given the samples $x_1,x_2,\dots,x_N$ of another random variable $X$, where $X$ and $Z$ are related via the joint distribution $p(x,z)$. Since $p(x,z) = p(x \mid z) \cdot p(z;\theta)$, then $p(x,z) = p(x,z;\theta)$.

We could then define the complete-data likelihood as $$ p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) $$ We can also assume that each pair $(x_i,z_i)$ is independent of and identically distributed to every other pair for $i \in \{1,2,\dots,N\}$ such that $$ p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) = \prod_{i=1}^N p(x_i,z_i;\theta) $$ and $$ \log p(x_1,x_2,\dots,x_N,z_1,z_2,\dots,z_N) = \sum_{i=1}^N \log p(x_i,z_i;\theta) $$ Next, let $$ p(x_i,z_i;\theta) = h(x_i,z_i) \cdot \exp(\eta (\theta) \cdot T(x_i,z_i) + A(\theta)) $$ such that $$ \log p(x_i,z_i;\theta) = \log h(x_i,z_i) + \eta (\theta) \cdot T(x_i,z_i) + A(\theta) $$ Then, note that \begin{align} \mathbb{E}\left[\log p(x_i,z_i;\theta) \mid x_i\right] &= \mathbb{E}\left[\log h(x_i,z_i) \mid x_i\right] + \eta (\theta) \cdot \mathbb{E}\left[T(x_i,z_i) \mid x_i\right] + A(\theta) \\ \log p(x_i,\mathbb{E}\left[z_i \mid x_i\right];\theta) &= \log h(x_i,\mathbb{E}\left[z_i \mid x_i\right]) + \eta (\theta) \cdot T(x_i,\mathbb{E}\left[z_i \mid x_i\right]) + A(\theta) \end{align} and so, in general, $$ \mathbb{E}\left[\log p(x_i,z_i;\theta) \mid x_i\right] \neq \log p(x_i,\mathbb{E}\left[z_i \mid x_i\right];\theta) $$

Best Answer

Related Solutions

EM Algorithm Benefits – Why Use the EM Algorithm Instead of Marginalizing Complete Likelihood?

Related Question