Solved – Derivation of M-step in EM algorithm for mixture of Gaussians

expectation-maximizationgaussian mixture distribution

I am trying to derive the parameter estimation equations for the M-step of the expectation maximization (EM) algorithm for a mixture Gaussians when all Gaussians share the same covariance matrix $\mathbf{\Sigma}$.

Pattern Recognition and Machine Learning by Bishop has a section on EM for Gaussian mixtures, and it includes a derivation of the M-step when all $K$ Gaussians have different covariance matrices $\mathbf{\Sigma_k}$. I think that if I can understand this derivation well, I can modify it to get what I want.

I understand the derivation given by Bishop for the M-step equation for $\mathbf{\mu_k}$. However, the book does not show detailed steps for the derivation of the M-step for $\mathbf{\Sigma_k}$. When I tried to derive it myself by computing $\frac{\partial \mathbf{L}}{\partial \mathbf{\Sigma_k}}$ and setting it to 0, I came across the following derivative that I don't know how to deal with:

$$
\frac{\partial}{\partial \mathbf{\Sigma_k}} \left ( (2\pi)^{-d/2}|\mathbf{\Sigma_k}|^{-1/2}e^{-\frac{1}{2}(x-\mathbf{\mu_k})^T\mathbf{\Sigma_k}^{-1}(x-\mathbf{\mu_k})}\right )
$$

Basically, it's the derivative of the multivariate Gaussian pdf with respect to the covariance matrix. How do I compute this derivative? I've computed the derivative of the logarithm of this function before when studying Gaussian Bayes classifiers, so that makes me think I've made a mistake somewhere.

Best Answer

I've found the answer and I'm posting it for posterity. I mentioned in the question that computing the derivative of the logarithm of the PDF was easier. It turns out that this can be used to compute the derivative of the PDF itself:

$$ \frac{\partial \ln (f)}{\partial \mathbf{\Sigma}_k} = \frac{1}{f} \frac{\partial f}{\partial \mathbf{\Sigma}_k}\\ \Rightarrow \frac{\partial f}{\partial \mathbf{\Sigma}_k} = f \cdot\frac{\partial \ln (f)}{\partial \mathbf{\Sigma}_k} $$

Also, it turns out that taking the derivative of the PDF with respect to $\mathbf{\Sigma}^{-1}$ is easier and leads to the same answer.

Related Solutions

Solved – Fitting a gaussian mixture model using stochastic gradient descent

Assuming that mus[d] is $\mu_j$, j.sigma is $\Sigma_j$, and G(x)/M(x) indeed computes the posterior probability of component $j$ given the data $x$, $$p(j \mid x) = \frac{\rho_j \mathcal{N}_x(\mu_j, \Sigma_j)}{\sum_k \rho_k \mathcal{N}_x(\mu_k, \Sigma_k)},$$ the gradient itself seems correct to me. But here are some things that I noticed that might help you to find your problem:

I would expect the access to the mean, the covariance, and the calculation of the posterior to all involve either j or d, whichever variable represents the component for which you want to compute the gradient in your code. If you tell us what j and d stand for, we might be able to tell you more.
If G(x)/M(x) accesses j.Sigma to compute the posterior, your code might not compute what you think it does. It might be better to first compute all the gradients of all the parameters, and then perform the update.
Stochastic gradient descent is usually not the first choice to optimize mixtures of Gaussians. Most often, expectation maximization (EM) is used (see, for example, Bishop, 2007). Even if you don't use EM, you might want to consider BFGS or L-BFGS (implemented in scipy.optimize) before using SGD. And even if you stick to SGD, you should consider using multiple data points ("batches") at a time to estimate the gradient, or at least including a momentum term. Briefly looking at Toscano and McMurray's paper, my guess is that they chose to use SGD because they were interested in modeling speech acquisition in a biologically more plausible way, rather than obtaining the best possible fit, and doing this online (that is, one data point at a time). If you don't need this, my advice would be to use EM.

(I just realized you specifically asked for online learning, so the only viable option for you might be to add the momentum term to speed things up a bit.)
The way you chose to compute the gradient is quite inefficient, which will further slow down learning. You might not have seen reasonable results because it takes forever before the algorithm converges to something interesting. Here is a slightly better way to calculate the gradient:
```
sigmaInv = inv(j.sigma)
dSigma = G(x)/M(x) * 0.5 * (-sigmaInv + numpy.sum(sigmaInv.dot(x - mus[d]) * x))
```
There are still ways to further improve the computation of the gradient. For example, we still get a valid ascent (although not a steepest ascent) direction if we multiple the gradient by a positive definite matrix (such as $\Sigma_j$, which would simplify the gradient a bit). It might also work better if we used a different parametrization of the covariance, such as Cholesky factors, and computed the gradients of those instead.

Solved – K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$

Is it true that up to some constant and scalar multiplication: $\lim_{\sigma \to 0} Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}}) = -J$?

This is not the case since – as you observed yourself – the limit diverges.

However, if we first transform $Q$ and then take the limit, we converge to the k-means objective. For $\Sigma_k = \sigma^2 I$ and $\pi_k = 1/K$ we have

\begin{align} Q &= \sum_{n,k} \gamma_{nk} \left( \log \pi_k + \log N(x_n \mid \mu_k, \Sigma_k) \right) \\ &= N \log\frac{1}{K} - \frac{1}{\sigma^2} \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 - N \frac{D}{2} \log 2\pi\sigma^2. \end{align}

Multiplying by $\sigma^2$ (which does not affect the EM algorithm, since $\sigma$ is not optimized but constant) and collecting all the constant terms in $C$, we see that \begin{align} Q &\propto - \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 + \sigma^2 C. \end{align} Note that maximizing this function with respect to $\mu$ for any $\gamma$ and $\sigma$ gives the same result as the objective function above, i.e., it is an equivalent formulation of the M-step. But taking the limit now yields $-J$.

As an aside, an in my view slightly more elegant formulation of EM is to use the objective function \begin{align} F(\mu, \gamma) &= \sum_{n,k} \gamma_{nk} \log \pi_k N(x_n \mid \mu_k, \Sigma_k)/\gamma_{nk} \\ &\propto -\sum_{n,k} \sum_{n, k} \gamma_{nk} ||x_n - \mu_k||^2 - \sigma^2 \sum_{n,k} \gamma_{nk} \log \gamma_{nk} + \sigma^2 C. \end{align} Using this objective function, the EM algorithm amounts to alternating between optimizing $F$ with respect to $\mu$ (M-step) and $\gamma$ (E-step). Taking the limit we see that both the M-step and the E-step converge to the k-means algorithm.

Best Answer

Related Solutions

Solved – Fitting a gaussian mixture model using stochastic gradient descent

Solved – K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$

Related Question