Solved – Derivation of M step for Gaussian mixture model

calculusclusteringexpectation-maximizationgaussian mixture distributionmathematical-statistics

Summary

So to summarize my question, how can I take
\begin{align}
= \sum_{i=1}^{n}W_{i1} \left(log (1-\sum_{j=2}^{K}\pi_j) -\frac{1}{2} log(|\Sigma_1|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_1)^{T} \Sigma_{1}^{-1}(x_i-\mu_1) \right)+
\sum_{i=1}^{n}\sum_{j=2}^{K} W_{ij} \left( log(\pi_j) -\frac{1}{2} log (|\Sigma_j|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_j)^{T} \Sigma_{j}^{-1}(x_i-\mu_j)\right)
\end{align}

and maximize it with regards to $\mu_{j}$ and $\Sigma_{j}$. I am having issues with the calculus. Below I provide a long derivation of the E step and how I got to this point. This is not necessary for you to read in order to answer my question.

EM algorithm background

The expectation maximisation algorithm can be defined as an alternating (iterative) algorithm, where we start with an initial value for $\theta$ as we would in a gradient descent approach. In gradient descent we would move in the direction of the gradient many times in order to maximise the function. However, in this case we cannot do a gradient descent since $l(\theta|x,z)$ and therefore have to do an alternating expectation maximisation:

  1. set $\theta_0$
  2. Alternate between:

\begin{align*}
& E :\text{To find an expression for} &\\
& E_z\left[l(\theta|X,Z)|X,\theta\right] &\\
& = \sum_{all Z} l(\theta|x,z) P(Z=z|x,\theta)
\end{align*}

\begin{align*}
& M :\text{Maximise over $\theta$} &\\
& E_z \left[l (\theta|X,Z)| X,\theta \right] &\\
\end{align*}

We want to maximise the log-likelihood:
$l(\theta|x)$

Problem: Difficult to maximise it directly.

\begin{align*}
\theta & = \left\{\pi_1,\dots,\pi_k,\mu_1,\dots,\mu_k,\Sigma_1,\dots,\Sigma_k \right\} & \\
l(\theta|x) & = \sum_{i=1}^{n} log \left(\sum_{k=1}^{K} \pi_k \frac{1}{|\Sigma_k|^{1/2}} \frac{1}{(2\pi)^{d/2}} \operatorname{exp}\left(-\frac{1}{2}(x_i-\mu_i)\Sigma_{k}^{-1} (x_i-\mu_k)\right)\right) &\\
\end{align*}

Difficult to maximise $l(\theta|x)$ because we have $n$ sum inside a log so we are trying to perform an EM procedure, so we end up with $n$ sum outside a log.
Let $Z$ be a vector of length $n$, with $Z_i$ being the identity of the component which generated $x_i$. Then,

\begin{align*}
l(\theta|X,Z) & = \sum_{i=1}^{n} log \left(\pi_{Z_i} \frac{1}{|\Sigma_{Z_i}|^{1/2}} \frac{1}{(2\pi)^{d/2}} \operatorname{exp}\left(-\frac{1}{2}(x_i-\mu_{Z_i})\Sigma_{Z_i}^{-1} (x_i-\mu_{Z_i})\right)\right)
\end{align*}

\begin{align*}
P(Z_i=j|X,\theta) & = \frac{P\left(X=x_i|\theta, Z_i =j \right) P\left(Z_i=j|\theta\right)}{\sum_{k=1}^{K}P\left(X=x_i|\theta, Z_i=k \right)P\left(Z_i=k|\theta\right)} &\\
& = \frac{\frac{1}{|\Sigma_j|^{1/2}} \frac{1}{(2\pi)^{d/2}} \operatorname{exp} \left(-\frac{1}{2}(x_i-\mu_j)^T\Sigma_{j}^{-1}(x_i-\mu_j)\right)\pi_j}{\sum_{k=1}^{K}\pi_k \frac{1}{|\Sigma_k|^{1/2}(2\pi)^{d/2}} \operatorname{exp} \left(-\frac{1}{2}(x_i-\mu_k)^{T}\Sigma_{k}^{-1}(x_i-\mu_j)\right)} &\\
& = w_{ij} &\\
\end{align*}

\begin{align*}
& E: E_Z \left[l(\theta | X_i, Z) |X,\theta \right] &\\
& E_Z \left[\sum_{i=1}^{n} log \left(\pi_{Z_i} \frac{1}{|\Sigma_{Z_i}|^{1/2} (2\pi)^{d/2}} \operatorname{exp}\left(-\frac{1}{2}(x_i-\mu_{Z_i})^T\Sigma_{Z_i}^{-1}(x_i-\mu_{Z_i})\right)\right)|X,\theta\right] &\\
& = \sum_{i=1}^{n} \sum_{j=1}^{K} P\left(Z_i=j|X,\theta\right) log \left(\pi_j \frac{1}{|\Sigma_j|^{1/2}(2\pi)^{d/2}} \operatorname{exp}\left(-\frac{1}{2}(x_i-\mu_i)^{T}\Sigma_j^{-1}(x_i-\mu_i)\right)|X,\theta\right) &\\
& = \sum_{i=1}^{n} \sum_{j=1}^{K} W_{ij} \left(log (\pi_j) -\frac{1}{2} log (|\Sigma_j|) -\frac{d}{2} log (2\pi) \left(-\frac{1}{2}(x_i-\mu_i)^{T}\Sigma_j^{-1}(x_i-\mu_i)\right)\right) &\\
& \text{set $\pi_1=1-\sum_{j=2}^{K}\pi_j$} &\\
& = \sum_{i=1}^{n}W_{i1} \left(log (1-\sum_{j=2}^{K}\pi_j) \right) -\frac{1}{2} log(|\Sigma_j|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_j)^{T} \Sigma_{j}^{-1}(x_i-\mu_j) + &\\
& \sum_{i=1}^{n}\sum_{j=2}^{K} W_{ij} (log(\pi_j)) -\frac{1}{2} log (|\Sigma_j|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_j)^{T} \Sigma_{j}^{-1}(x_i-\mu_j) &
\end{align*}

for $j=2,3,\dots,K$.

My question is how do I maximize the last part above with respect to $\mu_{j}$ and $\Sigma_{j}$.

\begin{align*}
& M :\text{Maximise over $\theta$} &\\
& E_z \left[l (\theta|X,Z)| X,\theta \right] &\\
\end{align*}

Summary

So to summarize my question, how can I take
\begin{align}
= \sum_{i=1}^{n}W_{i1} \left(log (1-\sum_{j=2}^{K}\pi_j) -\frac{1}{2} log(|\Sigma_1|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_1)^{T} \Sigma_{1}^{-1}(x_i-\mu_1) \right)+
\sum_{i=1}^{n}\sum_{j=2}^{K} W_{ij} \left( log(\pi_j) -\frac{1}{2} log (|\Sigma_j|) -\frac{d}{2} log(2\pi) -\frac{1}{2}(x_i-\mu_j)^{T} \Sigma_{j}^{-1}(x_i-\mu_j)\right)
\end{align}

and maximize it with regards to $\mu$ and $\Sigma$

I have found a similar post, but it was only with regards to differentiating $\Sigma_k$ .

Best Answer

TL;DR, we have that

$$\mu^*_k = \frac{\sum_{i=1}^n W_{ik}x_i}{\sum_{i=1}^n W_{ik}}$$ $$\Sigma^*_k = \frac{\sum_{i=1}^{n} W_{ik}(x_i -\mu^*_k)(x_i - \mu^*_k)'}{\sum_{i=1}^n W_{ik}}$$

In particular, this is the same as finding the MLE of a gaussian rv, but we weight by $W_{ik}$ for each $k$.

See below for the derivation, which is fairly similar to MLE for multivariate gaussian.

It may help to approach the E step a bit differently. In your second equation for the E step, you correctly have that you want to maximize

$$\sum_{i=1}^{n} \sum_{j=1}^{K} P\left(Z_i=j|X,\theta\right) log \left(\pi_j \frac{1}{|\Sigma_j|^{1/2}(2\pi)^{d/2}} \operatorname{exp}\left(-\frac{1}{2}(x_i-\mu_i)^{T}\Sigma_j^{-1}(x_i-\mu_i)\right)|X,\theta\right)$$

but we can more simply write that as

$$\sum_{i=1}^{n} \sum_{j=1}^{K} P\left(Z_i=j|X,\theta\right)\left(log(\pi_j) + log\left(\mathcal{N}(x_i;\mu_j,\Sigma_j)\right)\right)$$

where $\mathcal{N}$ denotes the gaussian density. Following your notation, let $W_{ij} = P\left(Z_i=j|X,\theta\right)$. As pointed out in the comments, we want to basically take derivatives with respect to $\mu_k$ and $\Sigma_k$ for each $k=1,\dots,K$, set to $0$, and solve to find the maximum. Our first step is to note that for a given $k$, taking derivative with respect to either $k$ parameter will be zero for any $j\neq k$ in that summation, and so for any $(i,j)$ where $j\neq k$, the derivative will just be zero. So maximizing the above is the same as maximizing

$$\sum_{i=1}^{n} W_{ik}\left(log(\pi_k) + log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right)\right)$$

A key point of the EM algorithm is precisely that $W_{ik}$ is estimated in the E step, and so we can think of it as a constant for our cases, and while we're at it, since $$W_{ik}\left(log(\pi_k) + log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right)\right) = W_{ik}log(\pi_k) + W_{ik}log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right)$$

for any $i$, we can also ignore that first part as taking derivative with respect to either parameter will be zero. So maximizing the E step for the $k$-th parameters is the same as maximizing

$$\sum_{i=1}^{n} W_{ik} log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right)$$

Suppose that $\Sigma_k \in \mathbb{R}^{d\times d}$. Then we know that the PDF of the guassian normal is $$\frac{1}{2\pi^{d/2}\det(\Sigma_k)^{-1/2}} \exp(-\frac{1}{2}(x_i-\mu_k)'\Sigma_k^{-1}(x-\mu_k))$$

and taking log and using all the properties of log (in particular, $log(xz/y) = log(x) + log(z) - log(y)$ and $log(e(x)) = x)$), we have

$$log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right) = log(1) - log(2pi^{-d/2}) - \frac{1}{2}log(\det(\Sigma_k)) - \frac{1}{2}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)$$

and again, since we are taking derivative, all the parts that don't include $\mu_k$ or $\Sigma_k$ will be set to zero, so maximizing

$$\sum_{i=1}^{n} W_{ik} log\left(\mathcal{N}(x_i;\mu_k,\Sigma_k)\right)$$

is the same as maximizing

$$\sum_{i=1}^{n} W_{ik}\left(-\frac{1}{2}log(\det(\Sigma_k)) - \frac{1}{2}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)\right)$$

which simplifies to

$$-\frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k)) - \frac{1}{2}\sum_{i=1}^{n} W_{ik}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)$$

Okay, we are finally ready to take derivatives, but we will need to know some vector and matrix derivative properties, so let's draw from the lovely Matrix Cookbook. From it, we know that $\frac{\partial x'Ax}{\partial x} = 2Ax$ if $x$ does not depend on $A$ and $A$ is symmetric. Since $\Sigma_k^{-1}$ is positive semidefinite, it is symmetric. So taking derivative with respect to $\mu_k$, we get rid of the first part, and for the second part we basically chain rule by taking with respect to $(x_i-\mu_k)$ and our derivative rule and then taking derivative of that with $\mu_k) and get that

$$\frac{\partial \frac{-1}{2}\sum_{i=1}^{n} W_{ik}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)}{\partial \mu_k} = \sum_{i=1}^n W_{ik}\Sigma_k^{-1}(\mu_k - x_i) = 0 $$

which implies that

$$\sum_{i=1}^n W_{ik}\Sigma_k^{-1}\mu_k = \sum_{i=1}^n W_{ik}\Sigma_k^{-1}x_i \implies \mu_k\sum_{i=1}^n W_{ik} = \sum_{i=1}^n W_{ik}x_i$$

and so $\mu_k = \frac{\sum_{i=1}^n W_{ik}x_i}{\sum_{i=1}^n W_{ik}}$. Yay!

Now let's do $\Sigma_k$. This one is trickier, but the key facts you need to know are that $\frac{\partial{x'Ax}}{\partial A} = xx'$, and that $\frac{\partial log(\det(A))}{\partial A} = A^{-T}$. Again check out the Matrix Cookbook to see why. We will also use the fact that $$-\frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k)) = \frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k^{-1}))$$

which follows from pushing the $-1$ into the log and using the fact that $det(A^{-1}) = det(A)^{-1}$. Then we can re-write

$$-\frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k)) - \frac{1}{2}\sum_{i=1}^{n} W_{ik}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k) = \frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k^{-1})) - \frac{1}{2}\sum_{i=1}^{n} W_{ik}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)$$

Taking derivative with respect to $\Sigma_k^{-1}$, we have

$$\frac{\partial \frac{1}{2}\sum_{i=1}^{n} W_{ik}log(\det(\Sigma_k^{-1})) - \frac{1}{2}\sum_{i=1}^{n} W_{ik}(x_i-\mu_k)'\Sigma_k^{-1}(x_i-\mu_k)}{\partial \Sigma_k^{-1}} = \frac{1}{2}\sum_{i=1}^n W_{ik}\Sigma_k - \frac{1}{2}\sum_{i=1}^{n} W_{ik}(x_i -\mu_k)(x_i - \mu_k)'$$

And setting this to zero and solving for $\Sigma_k$ gives us that

$$0 = \sum_{i=1}^n W_{ik}\Sigma_k -\sum_{i=1}^{n} W_{ik}(x_i -\mu_k)(x_i - \mu_k)'$$

which simplifies to

$$\Sigma_k = \frac{\sum_{i=1}^{n} W_{ik}(x_i -\mu_k)(x_i - \mu_k)'}{\sum_{i=1}^n W_{ik}}$$

using the previously maximized $\mu_k$ here, and we are done!

Related Question