[Math] Estimating a gaussian distribution from a GMM

estimationnormal distributionparameter estimationstatistics

Suppose that we have a Gaussian mixture model (GMM) in n-dimensional space:

$$P_1(x) = \sum_{i=1}^{C}\pi(c_i)\mathcal{N}(\mu_i,\Sigma_i)$$

We want to estimate a single Gaussian distribution from this set.

$$P_2(x) = \mathcal{N}(\mu,\Sigma)$$

For example, suppose that we have a GMM (having two Gaussians) like this:

GMM of two Gaussians

And want to convert it into a single Gaussian distribution:

single Gaussian

Such a way that the distribution of the Gaussian is as close as possible to the original GMM. In other words, if you sample a very large number of samples from $P_1(x)$, the most likely Gaussian for those points should be (close to) $P_2(x)$.

How can I estimate $\mu$ and $\Sigma$ for $P_2(x)$???

IMPORTANT NOTE: The dimensions are above 2, so we should deal with the covariance matrix, not a set of independent variances.

Important notice: I don't want to re-generate data from $P_1(x)$ to estimate $P_1(x)$. Instead, I want to estimate $P_2(x)$ (Gaussian) using the parameters in $P_1(x)$ (GMM).

Best Answer

For convenienc eof notation I use $\pi_i=\pi(c_i)$.

For $\mu$, you should take the weighted average of the mean:

$$\mu = \sum_{i=1}^{C}\pi_i\mu_i$$

For the covariance matrix:

$$\Sigma=\left(\sum_i^C \pi_i (\Sigma_i+\mu_i\mu_i^T)\right)-\mu\mu^T$$

For the intuitive reason of why this works, think about the mean of all points that are drawn from the GMM, where do you expect the mean to be?

But, in the following I'm writing a rigorous proof for that:

For $\mu$, you should calculate: $E_{x\sim GMM}[x]$

$$E_{x\sim GMM}[x]=\int_{x\in \mathcal{X}} x\sum_{i=1}^C \pi_i \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow=\sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}} x \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow=\sum_{i=1}^C \pi_i \mu_i$$

For the covariance, you should calculate: $$E_{x\sim GMM}[(x-\mu)(x-\mu)^T]=E_{x\sim GMM}[xx^T]-\mu\mu^T$$

Let's focus on $E_{x\sim GMM}[xx^T]$:

$$E_{x\sim GMM}[xx^T]=\int_{x\in \mathcal{X}} xx^T\sum_{i=1}^C \pi_i \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$ $$\Rightarrow = \sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}}xx^T\frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow = \sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}}xx^T\frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow = \sum_{i=1}^C \pi_i (\Sigma_i+\mu_i\mu_i^T)$$

Therefore the covariance of the GMM is:

$$\Sigma=\left(\sum_{i=1}^C \pi_i (\Sigma_i+\mu_i\mu_i^T)\right)-\mu\mu^T$$

The following Matlab code verifies the theoretical results for a GMM with two Gaussians:

    n1=1000000;
n2=3000000;
p1=n1/(n1+n2);
p2=n2/(n1+n2);

mu1=[0,0,0];
mu2=[10,10,10];
A=rand(3);
S1=A'*A
A=rand(3);
S2=A'*A
r1 = mvnrnd(mu1,S1,n1);
r2 = mvnrnd(mu2,S2,n2);

S1
S1_hat=cov(r1)

S2
S2_hat=cov(r2)

r=[r1;r2];
mu=mean(r)
mu_hat=p1*mu1+p2*mu2

S=cov(r)
S_hat=p1*(S1+mu1'*mu1)+p2*(S2+mu2'*mu2)-mu_hat'*mu_hat

Here is the result of running the code:

mu =

    7.5009    7.5007    7.5000


mu_hat =

    7.5000    7.5000    7.5000


S =

   20.5464   20.4126   19.7789
   20.4126   20.4026   19.7273
   19.7789   19.7273   19.8504


S_hat =

   20.5485   20.4149   19.7801
   20.4149   20.4051   19.7284
   19.7801   19.7284   19.8508
Related Question