Maximum Likelihood – Multimodality in Gaussian Mixture Models with Partially Observed Y

expectation-maximizationgaussian mixture distributionmaximum likelihoodsemi-supervised-learning

In discussing Gaussian mixture models (GMMs), https://normaldeviate.wordpress.com/2012/08/04/mixture-models-the-twilight-zone-of-statistics/ highlights the issue of

Multimodality of the Likelihood. In fact, the likelihood has many modes. Finding the global (but not infinite) mode is a nightmare. The EM [expectation-maximization] algorithm only finds local modes. In a sense, the MLE is not really a well-defined estimator because we can’t really find it.

Is the multimodality of the likelihood "fixed" (does the likelihood become uni-modal) when class labels $Y$ are partially observed?

By "class labels $Y$", I mean $Y \in \left\{0, 1, 2, \ldots, k-1\right\}$ indicating the component from which the observation was drawn, where the number of components $k$ is known ahead of time. I'm assuming a data generating process in which $Y$ is missing completely at random — for example, we might observe $Y$ for a random 5% of our data, with $Y$ missing for the remaining 95%, and missingness occurring independently of $X$.

Here is an image illustrating what I have in mind:

GMM data with Y MCAR

The labels $Y$ for all points are shown below:

GMM data with Y fully observed

If we ran the expectation maximization (EM) algorithm with the dataset where Y is partially observed, would we find that the likelihood is unimodal? If so, how many $Y$s need to be observed in order for that to be true? Would observing at least one $Y$ from each of the $k$ classes/components be sufficient? Would we need to observe at least one Y for only $k-1$ of the components?

Edit: when $Y$ is fully observed (as in the second graph, the one immediately above), we could estimate the mean and variance-covariance matrix for each of the $k$ components separately, and in this case the likelihood has a single maximum and a relatively simple closed-form solution (see Maximum Likelihood Estimators – Multivariate Gaussian). So my question is essentially about what happens to the likelihood (when does it become multimodal) as the probability of $Y$ being missing/unobserved increases from $p=0$ to $p \in \left(0, 1\right)$ to $p=1$.

Best Answer

As you pointed out the likelihood of the labeled data is just a product of multivariate gaussian likelihoods of the individual components, so you can analyze quite easily what happens when you add one unlabeled data point: consider for simplicity the case of two components with only the means unknown, then the total likelihood will have the form:

$$ \mathcal L(\mu_1,\mu_2) = f(\mu_1 | \hat \mu_1 , \Sigma_1) f(\mu_2 | \hat \mu_2 , \Sigma_2) \times ( w_1 f(x|\mu_1,\sigma) + w_2 f(x|\mu_2,\sigma) )$$

$$ = \tilde w_1 f(\mu_1 | \hat \mu_1 + \delta_1, \tilde \Sigma_1)f(\mu_2 | \hat \mu_2 , \Sigma_2) + \tilde w_2 f(\mu_2 | \hat \mu_2 + \delta_2, \tilde \Sigma_2)f(\mu_1 | \hat \mu_1 , \Sigma_1)$$

where $f(\cdot|\cdot)$ here denotes the gaussian pdf, $\hat \mu_1$,$\hat \mu_2$ are the MLE's of the labeled data, and $x$ is the unlabeled data point.

As you can see the likelihood itself becomes a sum (mixture) of two gaussian components, one with means $(\hat \mu_1 + \delta_1,\hat \mu_2)$ and the other with $(\hat \mu_1 ,\hat \mu_2 + \delta_2)$, where $\delta_1,\delta_2$ represent the amount by which the MLE's are shifted by the new data point.

The sum of two gaussians will have two modes if they are far enough apart from each other, and this depends on how far the point $x$ is from the means $\hat \mu_1$,$\hat \mu_2$, as well as on the covariances. Recall that the new means will be a weighted sum of the old means and the new point, weighted by their inverse covariances. The more labeled data points there are, the smaller the covariances $\Sigma_1,\Sigma_2$ will be and therefore the shifts $\delta_1,\delta_2$ will be smaller as well. However, you can always find a point $x$ far enough from the means $\hat \mu_1$,$\hat \mu_2$ (and equally far from both of them) such that the separation will be large enough and the likelihood will have two modes.

So theoretically the likelihood can be multimodal even with a single unlabeled point and any number of labeled points. Of course this becomes increasingly unlikely as the number of labeled points increases, since this new point will have to be increasingly far away from the means of the labeled data.

Related Question