Solved – Convergence from the EM Algorithm with bivariate mixture distribution

expectation-maximizationmaximum likelihoodmixture-distribution

I have a mixture model which I want to find the maximum likelihood estimator of given a set of data $x$ and a set of partially observed data $z$. I have implemented both the E-step (calculating the expectation of $z$ given $x$ and current parameters $\theta^k$), and the M-step, to minimize the negative log-likelihood given the expected $z$.

As I have understood it, the maximum likelihood is increasing for every iteration, this means that the negative log-likelihood must be decreasing for every iteration? However, as I iterate, the algorithm does not indeed produce decreasing values of the negative log-likelihood. Instead, it's may be both decreasing and increasing. For instance this was the values of the negative log-likelihood until convergence:

enter image description here

Is there here that I've misunderstood?

Also, for simulated data when I perform the maximumlikelihood for the true latent (unobserved) variables, I have a close to perfect fit, indicating there are no programming errors. For the EM algorithm it often converges to clearly suboptimal solutions, particularly for a specific subset of the parameters (i.e. the proportions of the classifying variables). It is well known that the algorithm may converge to local minima or stationary points, is there a conventional search heuristic or likewise to increase the likelihood of finding the global minimum (or maximum). For this particular problem I believe there are many miss classifications because, of the bivariate mixture, one of the two distributions takes values with probability one (it is a mixture of lifetimes where the the true lifetime is found by $T=z T_0 + (1-z)\infty$ where $z$ indicates the belonging to either distribution. The indicator $z$ is of course censored in the data set.
enter image description here

I added a second figure for when I start with the theoretical solution (which should be close to the optimal). However, as can be seen the likelihood and parameters diverges from this solution into one that is clearly inferior.

edit:
The full data are in the form $\mathbf{x_i}=(t_i,\delta_i,L_i,\tau_i,z_i)$ where $t_i$ is an observed time for subject $i$, $\delta_i$ indicates whether the time is associated with an actual event or if it is right censored (1 denotes event and 0 denotes right censoring), $L_i$ is the truncation time of the observation (possibly 0) with the truncation indicator $\tau_i$ and finally $z_i$ is the indicator to which population the observation belongs to (since its bivariate we only need to consider 0 and 1's).

For $z=1$ we have density function $f_z(t)=f(t|z=1)$, similarly it is associated with the tail distribution function $S_z(t)=S(t|z=1)$. For $z=0$ the event of interest will not occur. Although there is no $t$ associated with this distribution, we define it to be $\inf$, thus $f(t|z=0)=0$ and $S(t|z=0)=1$. This also yields the following full mixture distribution:

$f(t) = \sum_{i=0}^{1}p_if(t|z=i) = pf(t|z=1)$ and
$S(t) = 1 – p + pS_z(t)$

We proceed to define the general form of the likelihood:

$ L(\theta;\mathbf{x_i}) = \Pi_i \frac{f(t_i;\theta)^{\delta_i}S(t_i;\theta)^{1-\delta_i}}{S(L_i)^{\tau_i}}$

Now, $z$ is only partially observed when $\delta=1$, otherwise it is unknown. The full likelihood becomes

$ L(\theta,p;\mathbf{x_i}) = \Pi_i \frac{\big((p f_z(t_i;\theta))^{z_i}\big)^{\delta_i}\big((1-p)^{(1-z_i)}(p S_z(t_i;\theta))^{z_i}\big)^{1-\delta_i}}{\big((1-p)^{(1-z_i)}(p S_z(L_i;\theta))^{z_i}\big)^{\tau_i}}$

where $p$ is the weight of corresponding distribution (possibly associated with some covariates and their respective coefficients by some link function). In most literature this is simplified to the following loglikelihood

$\sum \Big( z_i \ln(p) + (1-p) \ln(1-p) – \tau_i\big(z_i \ln(p) + (1-z_i)\ln(1-p)\big) + \delta_i z_i f_z(t_i;\theta) + (1-\delta_i) z_i S_z(t_i;\theta) – \tau_i S_z(L_i;\theta)\Big)$

For the M-step, this function is maximized, although not in its entirety in 1 maximization method. Instead we not that this can be separated into parts $l(\theta,p; \cdot) = l_1(\theta,\cdot) + l_2(p,\cdot)$.

For the k:th+1 E-step, we must find the expected value of the (partially) unobserved latent variables $z_i$. We use the fact that for $\delta=1$, then $z=1$.

$E(z_i|\mathbf{x_i},\theta^{(k)},p^{(k)}) = \delta_i + (1-\delta_i) P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i})$

Here we have, by $P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i}) =\frac{P(\mathbf{x_i};\theta^{(k)},p^{(k)}|z_i=1)P(z_i=1;\theta^{(k)},p^{(k)})}{P(\mathbf{x_i};\theta^{(k)},p^{(k)})}$

which gives us $P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i})=\frac{pS_z(t_i;\theta^{(k)})}{1 – p + pS_z(t_i;\theta^{(k)})}$

(Note here that $\delta_i=0$, so there is no observed event, thus the probability of the data $\mathbf{x_i}$ is given by the tail distribution function.

Best Answer

The objective of EM is to maximize the observed data log-likelihood,

$$l(\theta) = \sum_i \ln \left[ \sum_{z} p(x_i, z| \theta) \right] $$

Unfortunately, this tends to be difficult to optimize with respect to $\theta$. Instead, EM repeatedly forms and maximizes the auxiliary function

$$Q(\theta , \theta^t) = \mathbb{E}_{z|\theta^t} \left (\sum_i \ln p(x_i, z_i| \theta) \right)$$

If $\theta^{t+1}$ maximizes $Q(\theta, \theta^t)$, EM guarantees that

$$l(\theta^{t+1}) \geq Q(\theta^{t+1}, \theta^t) \geq Q(\theta^t, \theta^t) = l(\theta^t)$$

If you'd like to know exactly why this is the case, Section 11.4.7 of Murphy's Machine Learning: A Probabilistic Perspective gives a good explanation. If your implementation doesn't satisfy these inequalities, you've made a mistake somewhere. Saying things like

I have a close to perfect fit, indicating there are no programming errors

is dangerous. With a lot of optimization and learning algorithms, it's very easy to make mistakes yet still get correct-looking answers most of the time. An intuition I'm fond of is that these algorithms are intended to deal with messy data, so it's not surprising that they also deal well with bugs!


On to the other half of your question,

is there a conventional search heuristic or likewise to increase the likelihood of finding the global minimum (or maximum)

Random restarts is the easiest approach; next easiest is probably simulated annealing over the initial parameters. I've also heard of a variant of EM called deterministic annealing, but I haven't used it personally so can't tell you much about it.

Related Question