[Math] Monotonicity of the hard EM algorithm.

learning-theorylinear algebraprobability distributionsst.statistics

Consider the problem where we want to find a maximum likelihood estimate of $\theta$, given $X$ and $$P_\theta(Y) = \sum_z P_\theta(Y,x)$$ where $x$ is a latent variable.

I know that the soft EM algorithm guarantees that every step will not decrease the likelihood of the observed data. I'm pretty sure it is also true for the hard EM algorithm, but I've been stumbling on a test case where a step of the hard EM seems to decrease likelihood function.

EDIT: Yeah, hard EM definitely does NOT converge… Toy example
$x \sim {0,1}$ is a latent random variable, (a coin flip)
$y \sim N(x, s)$ is the observed random variable
$s > 0$ is a parameter

There's only one observation, -0.75
Let's run hard EM, shall we?
No matter how we initialize s, the likelihood is maximized for $x=0$.
Assuming $x=0$, the maximum likelihood is given for $s = 0.75$. But what is the real likelihood of this model? it's $p(-0.75 | (x=0,s=0.75) ) * 0.5 + p( -0.75 | (x=1,s=0.75) ) * 0.5$ that's about $0.179$.
Yet, the maximum likelihood is achieved for $s \sim 1.11$ and it is about $0.195$ or $8.94\%$ better. It's not a local maximum problem either, we could have initialized s at 1.11.
The reason hard EM works in practice is that when you use the Viterbi algorithm on a long stationary chain in state space, you effectively sample the whole distribution.

EDIT:The more I look at it, the more it seems that the "hard EM" does not guarantee convergence to a local maximum of likelihood! Wikipedia somehow seemed to imply it did (I just stuck a big citation needed on that), but looking at http://www.csb.ethz.ch/education/spring09/bioinfomaticsII/pdf-lectures/HMM2-slides
I read

"Convergence is guaranteed; note that the algorithm (unlike Baum-Welch) does not maximize the likelihood P(Xm | Π) but contributions to P(Xm | Π, Q(Xm)) as a function of the parameters Π."
Yes, that makes much more sense. The HARD EM maximizes the likelihood taken at the best point.
Can anyone confirm that with more sources?

By hard EM I mean

$$x_t = \arg \max_z\left(P_{\theta_{t-1}}(Y,x)\right)$$
$$\theta_t = \arg \max_\theta \left(P_{\theta}(Y,x_t)\right)$$

I've verified manually that $x_t$ and $\theta_t$ is computed correctly, and I've verified the likelihood function…

If you'd like to try, the problem is the following, I'll write the parameter $\theta = \Sigma$, $Y = (A,B)$ and $x = X$

$$P_\Sigma( A,B, X) = \mathbf{1}_{AX=B} \frac{e^{-\frac{1}{2}X^tS^{-1}X}}{(2\pi)^{NM/2}|S|^{1/2}}$$
where $S$ is the block diagonal matrix, each block of size $M$. $\Sigma$ is always symmetric positive definite.
$$
S = \left(\begin{array}{cccc}
\Delta_1 \Sigma & 0 & \cdots & 0\\\\
0 & \Delta_2 \Sigma & \vdots & \vdots\\\\
\vdots & \vdots & \ddots & \vdots \\\\
0 & 0 & \cdots & \Delta_N \Sigma
\end{array}\right)
$$

Here's an example (copy paste from html won't work well, so if you're a kind soul who wants to give it a try, you can get the data here as well http://paste.ubuntu.com/639213/ )

Set $M=3$, $\Sigma_0 = I$, $N=10$, $\Delta =$
$[1.19571,0.98553,1.48534,1.72635,3.4422,0.96689,0.01366,0.53092,1.1134,0.9718]$

$$A^t = \left(\begin{array}{cccccccccc}
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\\\
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\\\
1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\\\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\\\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\\\
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\\\
\end{array}\right)
$$

$B^t = (-2.46269,-1.06192,0.16578,-3.390909,1.619429,3.33237,-1.00818, 0.90752)$
I find that $P$ is maximized when
$$X_1 = \left(\begin{array}{c}
-0.546023113539513\\\\
0\\\\
0\\\\
-0.450044162425061\\\\
0\\\\
0.645924400969046\\\\
-0.678282922487939\\\\
-0.616831349169687\\\\
0.973504460926985\\\\
-0.788339891896596\\\\
-0.716918568383787\\\\
1.11304689180543\\\\
-0.829046405178143\\\\
-1.42947697785166\\\\
2.21932425211845\\\\
-0.232873327013359\\\\
-0.401529911986066\\\\
-0.371369961109884\\\\
0.165779960962481\\\\
-0.00567269366556661\\\\
-0.0052466806358363\\\\
0\\\\
-0.220480684975234\\\\
-0.203920096563427\\\\
0\\\\
0\\\\
-0.427643477547802\\\\
0\\\\
0\\\\
0.907519813639624
\end{array}\right)$$

And given $X_1$ $P$ is maximized for

$$\Sigma_1 = \left(\begin{array}{ccc}
0.339228113 & 0.098120669 & -0.17565\\\\
0.098120669 & 0.140817533 & -0.15471\\\\
-0.17565382 & -0.154705723 & 0.444462
\end{array}\right)$$

However, I find that $P_{\Sigma_0}(A,B) = e^{-10.8816}$ and $P_{\Sigma_1}(A,B) = e^{-12.843}$

Best Answer

Hard EM is not the version proposed by the fathers of the EM algorithm, Dempster, Laird and Rubin (1977). EM maximises at each step $$ \mathbb{E}_{\theta^t}[\log f(y,X|\theta) | y] $$ in $\theta$ and takes $\theta^{t+1}$ as the argument of this maximum. By a simple application of the Jensen inequality, the observed likelihood increases at every step of the algorithm.

Related Question