Solved – When does the EM for Gaussian mixture model has one of the Gaussian diminish to exactly one point and have zero variance

clusteringexpectation-maximizationgaussian mixture distributionmachine learningmixed model

I had implemented the EM algorithm for mixture models as follows:

For the E-step I compute the soft-counts of assigning each point $x^{(t)} \in Data_n$ to an individual cluster $j \in \{1, …, K \}$ (by the posterior):

$$P(j | x^{(t)}) = \frac{\hat p(j) N( x^{(t)} ; \hat \mu^{(j)} , \hat \sigma^2_jI)}{\sum^{K}_{j=1} \hat p(j) N( x^{(t)} ; \hat \mu^{(j)}, \hat \sigma^2_k I ) } $$

For the M-step we re-compute the parameters of our model given the fixed posterior (i.e. given the soft assignments of each point to each cluster):

$$ \hat p_j = \frac{\sum^{n}_{t=1} p(j | x^{(t)}) }{n}$$

$$ \hat \mu^{(j)} = \frac{1}{n} \sum^{n}_{t=1} p(j | x^{(t)}) x^{(t)}$$

$$ \sigma^2_{j} = \frac{1}{dn} \sum^{n}_{t=1} p(j | x^{(t)}) \| x^{(t)} – \mu^{(j)} \|^2$$

Assuming that the above algorithm is implemented correctly, when exactly does the situation where one of the mixture components converges such that its mean is some data point $x^{(t)}$ and the standard deviation converges to zero i.e. $\sigma^2_j = 0$? Does there exist some data set such that the above scenario is possible or is it impossible if the algorithm is implemented correctly? The issue I have is that intuitively, since the exponential always is non-zero everywhere (except when there is a weird spike because of zero std) and because every data point has a soft-assignment to every cluster, it seems to me that conceptually, the scenario I am worried about should not be theoretically possible for a correct implementation of this EM algorithm. Am I correct? or can it actually happen in theory (and practice?)?

Best Answer

The EM algorithm is really just an iterative approximation to true Maximum Likelihood Estimation. Even if implemented correctly,

  • as per Xi'an's comment, it is sensitive to starting conditions, though this can be fought by adding a constraint on the variances of the Gaussians, to ensure they don't get "narrow";
  • it may only find a local maximum rather than a global one.

In practice, one often runs the EM procedure several times, with different starting conditions, to avoid mistaking a local optimum for a global one.

I think this is a good opportunity to highlight a passage by the late great MacKay, who uses this particular corner case to attack the principle of MLE in general:

enter image description here

KABOOM!

Soft K-means can blow up. Put one cluster exactly on one data point and let its variance go to zero - you can obtain an arbitrarily large likelihood! Maximum likelihood methods can break down by fi nding highly tuned models that fit part of the data perfectly. This phenomenon is known as over fitting. The reason we are not interested in these solutions with enormous likelihood is this: sure, these parameter-settings may have enormous posterior probability density, but the density is large over only a very small volume of parameter space. So the probability mass associated with these likelihood spikes is usually tiny.

We conclude that maximum likelihood methods are not a satisfactory general solution to data-modelling problems: the likelihood may be in finitely large at certain parameter settings. Even if the likelihood does not have in finitely large spikes, the maximum of the likelihood is often unrepresentative, in high dimensional problems.

Even in low-dimensional problems, maximum likelihood solutions can be unrepresentative. As you may know from basic statistics, the maximum likelihood estimator (22.15) for a Gaussian's standard deviation is a biased estimator, a topic that we'll take up in Chapter 24.

(Note that what he calls "soft k-means" is just EM.)

Related Question