[Math] GMM with full and diagonal covariances

matricesprobability distributionsstatisticsstochastic-analysisstochastic-calculus

I have Gaussian Mixture Model– distribution with probability density function, that is a weighted sum of Gaussian probability density functions:
\begin{equation}
p(X)=\sum_{i=1}^k \omega_i\mathcal{N}(X,\mu_i,\Sigma_i)=\sum_{i=1}^k \omega_ip_i(X),
\end{equation}

where $k$ is the number of components, $\mathcal{N}(X,\mu_i,\Sigma_i), i=1,…,k$ are Gaussian densities
with expectations (vectors) $\mu_i,i=1,…,k$ and covariance matrices $\Sigma_i,i=1,…,k$,

$\omega_i,i=1,…,k$ are weights: $\sum_{i=1}^k \omega_i=1.$

Covariance matrices $\Sigma_i,i=1,…,k$,are full — have correlation elements (non-zero non-diagonal elements).
How I can approximate this GMM via GMM with components with diagonal covariances. It is understood, that it will be more components in the weighted sum, but they will be diagonal.
Here on page 2 in is written, that it is possible (but without proof) :

https://www.ll.mit.edu/mission/cybersec/publications/publication-files/full_papers/0802_Reynolds_Biometrics-GMM.pdf

"It is also important to note that because the component Gaussian are
acting together to model the overall feature density, full covariance
matrices are not necessary even if the features are not statistically
independent. The linear combination of diagonal covariance basis Gaussians
is capable of modeling the correlations between feature vector elements.
The effect of using a set of M full covariance matrix Gaussians can be
equally obtained by using a larger set of diagonal covariance Gaussians. "

But how it can be done and what can be say if to compare cost of calculations for these 2 cases? Is it faster to use in calculations more components, but diagonal?
Thank you.

Best Answer

I don't know if this helps you. But the same claim has been made in

http://download.springer.com/static/pdf/237/art%253A10.1155%252FS1110865704310024.pdf?originUrl=http%3A%2F%2Fasp.eurasipjournals.springeropen.com%2Farticle%2F10.1155%2FS1110865704310024&token2=exp=1480612221~acl=%2Fstatic%2Fpdf%2F237%2Fart%25253A10.1155%25252FS1110865704310024.pdf*~hmac=37cc80cf0cee60b0efd6e74cc177540e8b4d1bc30c6e29a5771edc5a3e092ff9 (p. 435)

the exact passage is:

While the general model form supports full covariance matrices, that is, a covariance matrix with all its elements, typically only diagonal covariance matrices are used. This is done for three reasons. First, the density modeling of an Mth-order full covariance GMM can equally well be achieved using a larger-order diagonal covariance GMM.

with the explanation being:

GMMs with M > 1 using diagonal covariance matrices can model >distributions of feature vectors with correlated elements. Only in the degenerate case of M = 1 is the use of a diagonal covariance matrix incorrect for >feature vectors with correlated elements.

Related Solutions

[Math] Estimating a gaussian distribution from a GMM

For convenienc eof notation I use $\pi_i=\pi(c_i)$.

For $\mu$, you should take the weighted average of the mean:

$$\mu = \sum_{i=1}^{C}\pi_i\mu_i$$

For the covariance matrix:

$$\Sigma=\left(\sum_i^C \pi_i (\Sigma_i+\mu_i\mu_i^T)\right)-\mu\mu^T$$

For the intuitive reason of why this works, think about the mean of all points that are drawn from the GMM, where do you expect the mean to be?

But, in the following I'm writing a rigorous proof for that:

For $\mu$, you should calculate: $E_{x\sim GMM}[x]$

$$E_{x\sim GMM}[x]=\int_{x\in \mathcal{X}} x\sum_{i=1}^C \pi_i \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow=\sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}} x \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow=\sum_{i=1}^C \pi_i \mu_i$$

For the covariance, you should calculate: $$E_{x\sim GMM}[(x-\mu)(x-\mu)^T]=E_{x\sim GMM}[xx^T]-\mu\mu^T$$

Let's focus on $E_{x\sim GMM}[xx^T]$:

$$E_{x\sim GMM}[xx^T]=\int_{x\in \mathcal{X}} xx^T\sum_{i=1}^C \pi_i \frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$ $$\Rightarrow = \sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}}xx^T\frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow = \sum_{i=1}^C \pi_i \int_{x\in \mathcal{X}}xx^T\frac{1}{|2\pi \Sigma_i|^\frac{-1}{2}}e^{-\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}dx$$

$$\Rightarrow = \sum_{i=1}^C \pi_i (\Sigma_i+\mu_i\mu_i^T)$$

Therefore the covariance of the GMM is:

$$\Sigma=\left(\sum_{i=1}^C \pi_i (\Sigma_i+\mu_i\mu_i^T)\right)-\mu\mu^T$$

The following Matlab code verifies the theoretical results for a GMM with two Gaussians:

    n1=1000000;
n2=3000000;
p1=n1/(n1+n2);
p2=n2/(n1+n2);

mu1=[0,0,0];
mu2=[10,10,10];
A=rand(3);
S1=A'*A
A=rand(3);
S2=A'*A
r1 = mvnrnd(mu1,S1,n1);
r2 = mvnrnd(mu2,S2,n2);

S1
S1_hat=cov(r1)

S2
S2_hat=cov(r2)

r=[r1;r2];
mu=mean(r)
mu_hat=p1*mu1+p2*mu2

S=cov(r)
S_hat=p1*(S1+mu1'*mu1)+p2*(S2+mu2'*mu2)-mu_hat'*mu_hat

Here is the result of running the code:

mu =

    7.5009    7.5007    7.5000


mu_hat =

    7.5000    7.5000    7.5000


S =

   20.5464   20.4126   19.7789
   20.4126   20.4026   19.7273
   19.7789   19.7273   19.8504


S_hat =

   20.5485   20.4149   19.7801
   20.4149   20.4051   19.7284
   19.7801   19.7284   19.8508