[Math] How gaussian mixture models work

machine learningprobability

I am given an example:

Suppose 1000 observations are drawn from $N(0,1)$ and $N(5,2)$ with mixing parameters $\pi_{1}=0.2$ and $\pi_{2}=0.8$ respectively. Suppose we only know $\sigma$ and want to estimate $\mu$ and $\pi$. How does one go about using Gaussian Mixture models to estimate these parameters? I know I have to use the EM algorithm but I do not know where to start. I want to use this simple example to get a better understanding of how it works.

Best Answer

The canonical thing to do would be to add latent variables $C_i, i = 1, 2, ..., n$. Letting $X_1, X_2, ..., X_n$, $n = 1000$ be your data, you would set $X_i | C_i = 1 \sim N(0, 1)$ and $X_i | C_i = 0 \sim N(5, 2)$ with $C_i = 1$ with probability $\pi = 0.2$ and $C_i = 0$ with probability $1 - \pi$.

The likelihood after augmenting becomes $$L = \prod_{i=1}^n [\pi N(x_i | \mu_1, 1)]^{C_i} [(1 - \pi) N(x_i | \mu_2, 2)]^{1 - C_i},$$ and the log-likelihood becomes $$\ell = \sum_{i = 1}^n C_i[\log(\pi) + \log(N(x_i | \mu_1, 1))] + (1 - C_i) [\log(1 - \pi) + \log(N(x_i | \mu_2, 2)].$$

We don't know the values of the $C_i$ so we can't maximize this directly; EM works by replacing this maximization problem with maximizing an expected log-likelihood. To apply EM, we first initialize $\mu_1, \mu_2$, and $\pi$ to $\mu_1^{(0}, \mu_2^{(0)}, \pi^{(0)}$. Next, we take the expectation of $\ell$ with respect to the conditional distribution of $[C_1, ..., C_n | X_1, ..., X_n]$ evaluated at the values of the parameters we just initialized. With a little effort it can be shown that $E[C_i | X_1, ..., X_n]$ under this distribution is $$\rho_i^{(0)} = \frac{\pi^{(0)} N(x_i | \mu_1^{(0)}, 1)}{\pi^{(0)} N(x_i |\mu_1^{(0)}, 1) + (1 - \pi^{(0)}) N(x_i | \mu_2^{(0)}, 2)}.$$ Plugging this into $E[\ell | X_1, X_2, ..., X_n]$ gives $$E[\ell | X_1, X_2, ..., X_n] = \sum_{i = 1} ^ n \rho_i^{(0)}[\log(\pi) + \log(N(x_i | \mu_1, 1))] + (1 - \rho_i^{(0)}) [\log(1 - \pi) + \log(N(x_i | \mu_2, 2).$$ Now we take the gradient of this expression in $\mu_1, \mu_2, \pi$, set it equal to $0$, and solve for $\mu_1, \mu_2, \pi$ in terms of $x_i$ and $\rho_i^{(0)}$. The solutions to this equation furnish $\mu_1^{(1)}$, $\mu_2^{(1)}$, and $\pi^{(1)}$, and we start back over with these new values. Iterate until convergence.

In the language of the EM-algorithm, the following sets can be broken into an E-step and M-step. In the E-step of iteration $k+1$st, we calculate $\rho_i^{(k)}$ for $i = 1, 2, ..., n$ in terms of $\mu_1^{(k)}$, $\mu_2^{(k)}$, and $\pi^{(k)}$. In the M-step we set the derivative of the expected log-likelihood to $0$ and solve, which turns out to give $$\pi^{(k+1)} = \frac{\sum \rho_i^{(k)}}{n},$$ $$\mu_1 ^{(k+1)} = \frac{\sum_{i = 1}^n \rho_i^{(k)} x_i}{\sum_{i = 1} ^ n \rho_i^{(k)}},$$ $$\mu_2 ^{(k+1)} = \frac{\sum_{i = 1}^n (1 - \rho_i^{(k)})x_i}{\sum_{i = 1} ^ n (1 - \rho_i^{(k)})}.$$ Intuitively, what is going on is that $\rho_i$ is something like the probability that $C_i = 1$ - occasionally you'll hear $\rho_i$ referred to as a responsibiity to a mixture component; so average these together to get the estimate of $\pi$, and take a weighted average of the $X_i$ according to their probability of having $C_i = 1$ to get $\mu_1$, and a similar story for $\mu_2$.

Related Question