Expectation Maximization – Convergence for Mixture of Gaussians

convergenceexpectation-maximizationgaussian mixture distributionnormal distribution

Is the Mixture of Gaussians model (an example of latent class analysis) gauranteed to converge on a viable solution even on Unimodal data using the Expectation Maximization algorithm to estimate the classes?

EM guarantees it will find a local optimum, but what does that necessarily mean with regard to the parameters?

E.g. If I have (singlevariate or multivariate) gaussian data, and I try to fit a mixture model of two Gaussians on it, what is theoretically supposed to happen?

Best Answer

Assuming no restrictions on the variables $\mu_k$ and $\sigma_k$ (the means and standard deviations of the separate components of the mixture model), the EM algorithm may not converge on a local maximum, as the likelihood function is unbounded in this case.

To illustrate, suppose we observe data points $x_1, ..., x_n$ and we fit a two component Gaussian mixture model. Then the log likelihood can be written as

$ \displaystyle \sum_{i = 1}^n \log\left( p_1 \times f(x_i | \mu_1, \sigma_1) + p_2 \times f(x_i|\mu_2, \sigma_2)\right)$

where $f(x|\mu, \sigma)$ is the pdf of the normal distribution with mean = $\mu$ and standard deviation = $\sigma$. Consider now the case if we set $\mu_1 = \bar x$ (mean of observed data) and $\sigma_1 = \hat \sigma$ (standard deviation of observed data), with $p_1 = p_2 = 0.5$ (side note: these values are somewhat arbitrary, used only getting a lower bound). Then the contribution to the likelihood function for each point will be at least

$\log(0.5 \times f(x_i|\bar x, \hat \sigma) )$

i.e. the contribution for each observation can be bounded from below for all values of $\mu_2$ and $\sigma_2$. Then consider if we set $\mu_2 = x_1$. As $\sigma_2 \rightarrow 0$, $f(x_1 | x_1, \sigma_2) \rightarrow \infty$. Of course, this implies

$\log(0.5 \times f(x_1| x_1, \sigma_2) + 0.5 \times f(x_1 | \bar x, \hat \sigma)) \rightarrow \infty$

as well. Since the contributions of all the other observations are bounded from below by $\log(0.5 \times f(x_i|\bar x, \hat \sigma) )$, this implies that the likelihood function is unbounded.

Since the likelihood function is unbounded, if the EM algorithm approaches this infinite "peak", it will not converge. However, this likelihood typically does have non-degenerate local maximum, assuming $n$ is sufficiently large relative to $k$ (number of components). So the EM algorithm may find these non-degenerate local maximum, which are undoubtably better estimators than the degenerate solution. In fact, if $n$ is large relative to $k$, it is very likely to find a non-degenerate local max rather than to approach the degenerate solution.