Solved – Derivation of maximum likelihood for a Gaussian mixture model

expectation-maximizationgaussian mixture distributionmaximum likelihoodnormal distributionprobability

I'm working my way through the derivation of EM in Bishop (p. 435).

I'm stuck trying to derive to MLE for $\mu_k$ for the gaussian mixture model.

Basically I get an extra sum in the numerator.

For those that don't have the book:

The log likelihood for the gaussian mixture model is:

$$ ln\; p(X|\pi,\mu,\Sigma) = \sum_{n=1}^{N} ln \left\{ \sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k) \right\} $$

When I take derivatives wrt $\mu_k$:

recognise that we're dealing with $ln(f(x))$ and the derivative is $ \frac{f'(x)}{f(x)} $
This gives us:

$$ \sum_{n=1}^{N} \frac{1}{\sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k)} \times \frac{\partial \sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k) }{\partial \mu_k} $$

Now only have to solve the differential in the right most term:

$$ \frac{ \partial \sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k) }{\partial \mu_k} = \sum_{k=1}^K -0.5(2\Sigma^{-1}(x-\mu_k)\times \pi_k N(x_n|\mu_k,\Sigma_k) $$

This leaves me with:

$$ \sum_{n=1}^{N} \frac{ \sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k) }{\sum_{k=1}^K \pi_k N(x_n|\mu_k,\Sigma_k)} \times -0.5(2\Sigma^{-1}(x-\mu_k)) $$

The solution in the book is:

$$ \sum_{n=1}^{N} \frac{ \pi_k N(x_n|\mu_k,\Sigma_k) }{\sum_{j} \pi_j N(x_n|\mu_j,\Sigma_j)} \times 0.5(2\Sigma^{-1}(x-\mu_k)) $$

How is it that

There's no summation in their numerator?
They've changed what they're summing over (k -> j) ?
They have a positive final term, whereas I have a negative?

Thanks

Best Answer

To avoid any confusion, the summation index and the index of the $\mu$ that you differentiate with should be different. From the beginning, assume the likelihood is written with index $j$ and you want to differentiate it with $\mu_k$:

$$\frac{\partial \sum_{j=1}^K \pi_j N(x_n|\mu_j,\Sigma_j) }{\partial \mu_k}=\frac{ \pi_k\partial N(x_n|\mu_k,\Sigma_k)}{\partial \mu_k}$$ which explains why the answer doesn't have a summation in the numerator.

You'll have a minus in $(x-\mu_k)$, i.e. differentiating wrt $\mu_k$ gives $-1$, and also another minus in $\exp(-(\ldots))$ expression in normal PDF. They'll cancel out each other.

Related Solutions

Solved – Fitting a gaussian mixture model using stochastic gradient descent

Assuming that mus[d] is $\mu_j$, j.sigma is $\Sigma_j$, and G(x)/M(x) indeed computes the posterior probability of component $j$ given the data $x$, $$p(j \mid x) = \frac{\rho_j \mathcal{N}_x(\mu_j, \Sigma_j)}{\sum_k \rho_k \mathcal{N}_x(\mu_k, \Sigma_k)},$$ the gradient itself seems correct to me. But here are some things that I noticed that might help you to find your problem:

I would expect the access to the mean, the covariance, and the calculation of the posterior to all involve either j or d, whichever variable represents the component for which you want to compute the gradient in your code. If you tell us what j and d stand for, we might be able to tell you more.
If G(x)/M(x) accesses j.Sigma to compute the posterior, your code might not compute what you think it does. It might be better to first compute all the gradients of all the parameters, and then perform the update.
Stochastic gradient descent is usually not the first choice to optimize mixtures of Gaussians. Most often, expectation maximization (EM) is used (see, for example, Bishop, 2007). Even if you don't use EM, you might want to consider BFGS or L-BFGS (implemented in scipy.optimize) before using SGD. And even if you stick to SGD, you should consider using multiple data points ("batches") at a time to estimate the gradient, or at least including a momentum term. Briefly looking at Toscano and McMurray's paper, my guess is that they chose to use SGD because they were interested in modeling speech acquisition in a biologically more plausible way, rather than obtaining the best possible fit, and doing this online (that is, one data point at a time). If you don't need this, my advice would be to use EM.

(I just realized you specifically asked for online learning, so the only viable option for you might be to add the momentum term to speed things up a bit.)
The way you chose to compute the gradient is quite inefficient, which will further slow down learning. You might not have seen reasonable results because it takes forever before the algorithm converges to something interesting. Here is a slightly better way to calculate the gradient:
```
sigmaInv = inv(j.sigma)
dSigma = G(x)/M(x) * 0.5 * (-sigmaInv + numpy.sum(sigmaInv.dot(x - mus[d]) * x))
```
There are still ways to further improve the computation of the gradient. For example, we still get a valid ascent (although not a steepest ascent) direction if we multiple the gradient by a positive definite matrix (such as $\Sigma_j$, which would simplify the gradient a bit). It might also work better if we used a different parametrization of the covariance, such as Cholesky factors, and computed the gradients of those instead.

Solved – K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$

Is it true that up to some constant and scalar multiplication: $\lim_{\sigma \to 0} Q((\pi, \mu, \Sigma), (\pi, \mu, \Sigma)^{\text{old}}) = -J$?

This is not the case since – as you observed yourself – the limit diverges.

However, if we first transform $Q$ and then take the limit, we converge to the k-means objective. For $\Sigma_k = \sigma^2 I$ and $\pi_k = 1/K$ we have

\begin{align} Q &= \sum_{n,k} \gamma_{nk} \left( \log \pi_k + \log N(x_n \mid \mu_k, \Sigma_k) \right) \\ &= N \log\frac{1}{K} - \frac{1}{\sigma^2} \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 - N \frac{D}{2} \log 2\pi\sigma^2. \end{align}

Multiplying by $\sigma^2$ (which does not affect the EM algorithm, since $\sigma$ is not optimized but constant) and collecting all the constant terms in $C$, we see that \begin{align} Q &\propto - \sum_{n,k} \gamma_{nk} ||x_n - \mu_k||^2 + \sigma^2 C. \end{align} Note that maximizing this function with respect to $\mu$ for any $\gamma$ and $\sigma$ gives the same result as the objective function above, i.e., it is an equivalent formulation of the M-step. But taking the limit now yields $-J$.

As an aside, an in my view slightly more elegant formulation of EM is to use the objective function \begin{align} F(\mu, \gamma) &= \sum_{n,k} \gamma_{nk} \log \pi_k N(x_n \mid \mu_k, \Sigma_k)/\gamma_{nk} \\ &\propto -\sum_{n,k} \sum_{n, k} \gamma_{nk} ||x_n - \mu_k||^2 - \sigma^2 \sum_{n,k} \gamma_{nk} \log \gamma_{nk} + \sigma^2 C. \end{align} Using this objective function, the EM algorithm amounts to alternating between optimizing $F$ with respect to $\mu$ (M-step) and $\gamma$ (E-step). Taking the limit we see that both the M-step and the E-step converge to the k-means algorithm.

Best Answer

Related Solutions

Solved – Fitting a gaussian mixture model using stochastic gradient descent

Solved – K-means as a limit case of EM algorithm for Gaussian mixtures with covariances $\epsilon^2 I$ going to $0$

Related Question