Solved – Convergence from the EM Algorithm with bivariate mixture distribution

expectation-maximizationmaximum likelihoodmixture-distribution

I have a mixture model which I want to find the maximum likelihood estimator of given a set of data $x$ and a set of partially observed data $z$. I have implemented both the E-step (calculating the expectation of $z$ given $x$ and current parameters $\theta^k$), and the M-step, to minimize the negative log-likelihood given the expected $z$.

As I have understood it, the maximum likelihood is increasing for every iteration, this means that the negative log-likelihood must be decreasing for every iteration? However, as I iterate, the algorithm does not indeed produce decreasing values of the negative log-likelihood. Instead, it's may be both decreasing and increasing. For instance this was the values of the negative log-likelihood until convergence:

enter image description here

Is there here that I've misunderstood?

Also, for simulated data when I perform the maximumlikelihood for the true latent (unobserved) variables, I have a close to perfect fit, indicating there are no programming errors. For the EM algorithm it often converges to clearly suboptimal solutions, particularly for a specific subset of the parameters (i.e. the proportions of the classifying variables). It is well known that the algorithm may converge to local minima or stationary points, is there a conventional search heuristic or likewise to increase the likelihood of finding the global minimum (or maximum). For this particular problem I believe there are many miss classifications because, of the bivariate mixture, one of the two distributions takes values with probability one (it is a mixture of lifetimes where the the true lifetime is found by $T=z T_0 + (1-z)\infty$ where $z$ indicates the belonging to either distribution. The indicator $z$ is of course censored in the data set.
enter image description here

I added a second figure for when I start with the theoretical solution (which should be close to the optimal). However, as can be seen the likelihood and parameters diverges from this solution into one that is clearly inferior.

edit:
The full data are in the form $\mathbf{x_i}=(t_i,\delta_i,L_i,\tau_i,z_i)$ where $t_i$ is an observed time for subject $i$, $\delta_i$ indicates whether the time is associated with an actual event or if it is right censored (1 denotes event and 0 denotes right censoring), $L_i$ is the truncation time of the observation (possibly 0) with the truncation indicator $\tau_i$ and finally $z_i$ is the indicator to which population the observation belongs to (since its bivariate we only need to consider 0 and 1's).

For $z=1$ we have density function $f_z(t)=f(t|z=1)$, similarly it is associated with the tail distribution function $S_z(t)=S(t|z=1)$. For $z=0$ the event of interest will not occur. Although there is no $t$ associated with this distribution, we define it to be $\inf$, thus $f(t|z=0)=0$ and $S(t|z=0)=1$. This also yields the following full mixture distribution:

$f(t) = \sum_{i=0}^{1}p_if(t|z=i) = pf(t|z=1)$ and
$S(t) = 1 – p + pS_z(t)$

We proceed to define the general form of the likelihood:

$ L(\theta;\mathbf{x_i}) = \Pi_i \frac{f(t_i;\theta)^{\delta_i}S(t_i;\theta)^{1-\delta_i}}{S(L_i)^{\tau_i}}$

Now, $z$ is only partially observed when $\delta=1$, otherwise it is unknown. The full likelihood becomes

$ L(\theta,p;\mathbf{x_i}) = \Pi_i \frac{\big((p f_z(t_i;\theta))^{z_i}\big)^{\delta_i}\big((1-p)^{(1-z_i)}(p S_z(t_i;\theta))^{z_i}\big)^{1-\delta_i}}{\big((1-p)^{(1-z_i)}(p S_z(L_i;\theta))^{z_i}\big)^{\tau_i}}$

where $p$ is the weight of corresponding distribution (possibly associated with some covariates and their respective coefficients by some link function). In most literature this is simplified to the following loglikelihood

$\sum \Big( z_i \ln(p) + (1-p) \ln(1-p) – \tau_i\big(z_i \ln(p) + (1-z_i)\ln(1-p)\big) + \delta_i z_i f_z(t_i;\theta) + (1-\delta_i) z_i S_z(t_i;\theta) – \tau_i S_z(L_i;\theta)\Big)$

For the M-step, this function is maximized, although not in its entirety in 1 maximization method. Instead we not that this can be separated into parts $l(\theta,p; \cdot) = l_1(\theta,\cdot) + l_2(p,\cdot)$.

For the k:th+1 E-step, we must find the expected value of the (partially) unobserved latent variables $z_i$. We use the fact that for $\delta=1$, then $z=1$.

$E(z_i|\mathbf{x_i},\theta^{(k)},p^{(k)}) = \delta_i + (1-\delta_i) P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i})$

Here we have, by $P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i}) =\frac{P(\mathbf{x_i};\theta^{(k)},p^{(k)}|z_i=1)P(z_i=1;\theta^{(k)},p^{(k)})}{P(\mathbf{x_i};\theta^{(k)},p^{(k)})}$

which gives us $P(z_i=1;\theta^{(k)},p^{(k)}|\mathbf{x_i})=\frac{pS_z(t_i;\theta^{(k)})}{1 – p + pS_z(t_i;\theta^{(k)})}$

(Note here that $\delta_i=0$, so there is no observed event, thus the probability of the data $\mathbf{x_i}$ is given by the tail distribution function.

Best Answer

The objective of EM is to maximize the observed data log-likelihood,

$$l(\theta) = \sum_i \ln \left[ \sum_{z} p(x_i, z| \theta) \right] $$

Unfortunately, this tends to be difficult to optimize with respect to $\theta$. Instead, EM repeatedly forms and maximizes the auxiliary function

$$Q(\theta , \theta^t) = \mathbb{E}_{z|\theta^t} \left (\sum_i \ln p(x_i, z_i| \theta) \right)$$

If $\theta^{t+1}$ maximizes $Q(\theta, \theta^t)$, EM guarantees that

$$l(\theta^{t+1}) \geq Q(\theta^{t+1}, \theta^t) \geq Q(\theta^t, \theta^t) = l(\theta^t)$$

If you'd like to know exactly why this is the case, Section 11.4.7 of Murphy's Machine Learning: A Probabilistic Perspective gives a good explanation. If your implementation doesn't satisfy these inequalities, you've made a mistake somewhere. Saying things like

I have a close to perfect fit, indicating there are no programming errors

is dangerous. With a lot of optimization and learning algorithms, it's very easy to make mistakes yet still get correct-looking answers most of the time. An intuition I'm fond of is that these algorithms are intended to deal with messy data, so it's not surprising that they also deal well with bugs!

On to the other half of your question,

is there a conventional search heuristic or likewise to increase the likelihood of finding the global minimum (or maximum)

Random restarts is the easiest approach; next easiest is probably simulated annealing over the initial parameters. I've also heard of a variant of EM called deterministic annealing, but I haven't used it personally so can't tell you much about it.

Related Solutions

EM Algorithm – Manual Implementation in R

You have several problems in the source code:

As @Pat pointed out, you should not use log(dnorm()) as this value can easily go to infinity. You should use logmvdnorm
When you use sum, be aware to remove infinite or missing values
You looping variable k is wrong, you should update loglik[k+1] but you update loglik[k]
The initial values for your method and mixtools are different. You are using $\Sigma$ in your method, but using $\sigma$ for mixtools(i.e. standard deviation, from mixtools manual).
Your data do not look like a mixture of normal (check histogram I plotted at the end). And one component of the mixture has very small s.d., so I arbitrarily added a line to set $\tau_1$ and $\tau_2$ to be equal for some extreme samples. I add them just to make sure the code can work.

I also suggest you put complete codes (e.g. how you initialize loglik[]) in your source code and indent the code to make it easy to read.

After all, thanks for introducing mixtools package, and I plan to use them in my future research.

I also put my working code for your reference:

# EM algorithm manually
# dat is the data
setwd("~/Downloads/")
load("datem.Rdata")
x <- dat

# initial values
pi1<-0.5
pi2<-0.5
mu1<--0.01
mu2<-0.01
sigma1<-sqrt(0.01)
sigma2<-sqrt(0.02)
loglik<- rep(NA, 1000)
loglik[1]<-0
loglik[2]<-mysum(pi1*(log(pi1)+log(dnorm(dat,mu1,sigma1))))+mysum(pi2*(log(pi2)+log(dnorm(dat,mu2,sigma2))))

mysum <- function(x) {
  sum(x[is.finite(x)])
}
logdnorm <- function(x, mu, sigma) {
  mysum(sapply(x, function(x) {logdmvnorm(x, mu, sigma)}))  
}
tau1<-0
tau2<-0
#k<-1
k<-2

# loop
while(abs(loglik[k]-loglik[k-1]) >= 0.00001) {
  # E step
  tau1<-pi1*dnorm(dat,mean=mu1,sd=sigma1)/(pi1*dnorm(x,mean=mu1,sd=sigma1)+pi2*dnorm(dat,mean=mu2,sd=sigma2))
  tau2<-pi2*dnorm(dat,mean=mu2,sd=sigma2)/(pi1*dnorm(x,mean=mu1,sd=sigma1)+pi2*dnorm(dat,mean=mu2,sd=sigma2))
  tau1[is.na(tau1)] <- 0.5
  tau2[is.na(tau2)] <- 0.5

  # M step
  pi1<-mysum(tau1)/length(dat)
  pi2<-mysum(tau2)/length(dat)

  mu1<-mysum(tau1*x)/mysum(tau1)
  mu2<-mysum(tau2*x)/mysum(tau2)

  sigma1<-mysum(tau1*(x-mu1)^2)/mysum(tau1)
  sigma2<-mysum(tau2*(x-mu2)^2)/mysum(tau2)

  #  loglik[k]<-sum(tau1*(log(pi1)+log(dnorm(x,mu1,sigma1))))+sum(tau2*(log(pi2)+log(dnorm(x,mu2,sigma2))))
  loglik[k+1]<-mysum(tau1*(log(pi1)+logdnorm(x,mu1,sigma1)))+mysum(tau2*(log(pi2)+logdnorm(x,mu2,sigma2)))
  k<-k+1
}

# compare
library(mixtools)
gm<-normalmixEM(x,k=2,lambda=c(0.5,0.5),mu=c(-0.01,0.01),sigma=c(0.01,0.02))
gm$lambda
	gm$mu
gm$sigma

gm$loglik

Historgram Histogram

EM Algorithm Benefits – Why Use the EM Algorithm Instead of Marginalizing Complete Likelihood?

So I've discussed this with a colleague.

Consider the marginalisation $$p(\mathbf{X} \mid \boldsymbol{\theta}) = \int p(\mathbf{X} \mid \mathbf{Z}, \boldsymbol{\theta})p(\mathbf{Z} \mid \boldsymbol{\theta}) d\mathbf{Z}.$$

This can be rewritten as the expectation $$\mathbb{E}_{\mathbf{Z} \mid \boldsymbol{\theta}}\left[ p(\mathbf{X} \mid \mathbf{Z}, \boldsymbol{\theta}) \right].$$

If this can be calculated exactly, we're fine. However, if the expectation is intractable, we need to compute a numerical approximation. Hence to maximise w.r.t. $\boldsymbol{\theta}$ we will need to use some iterative procedure like gradient-ascent.

Suppose we have the current estimate $\boldsymbol{\hat{\theta}^{(t)}}$ for $\boldsymbol{\theta}$. If we now try and approximate the expectation by sampling from $\mathbf{Z}$, we get $$\frac{1}{n}\sum_{i=1}^n p(\mathbf{X} \mid \mathbf{Z}_i, \boldsymbol{\theta})$$ but the $\mathbf{Z}_i$ were sampled from the distribution $\mathbf{Z} \mid \boldsymbol{\hat{\theta}^{(t)}}$, so this actually approximates the expectation $$\mathbb{E}_{\mathbf{Z} \mid \boldsymbol{\hat{\theta}^{(t)}}}\left[ p(\mathbf{X} \mid \mathbf{Z}, \boldsymbol{\theta}) \right] \approx \frac{1}{n}\sum_{i=1}^n p(\mathbf{X} \mid \mathbf{Z}_i, \boldsymbol{\theta})$$ which is not the same as the original expectation we wanted. Hence if we maximised this new expectation w.r.t. $\boldsymbol{\theta}$ we won't be doing Maximum Likelihood Estimation.

The EM algorithm therefore takes a different approach.

Best Answer

Related Solutions

EM Algorithm – Manual Implementation in R

EM Algorithm Benefits – Why Use the EM Algorithm Instead of Marginalizing Complete Likelihood?

Related Question