Solved – How to use Kullback-leibler divergence if mean and standard deviation of of two Gaussian Distribution is provided

gaussian mixture distributionjavakullback-leiblernormal distributionspark-mllib

With Apache Spark MLLib library I am trying to find Clusters within a dataset by using Gaussian Mixture Model (number cluster =3) . Now it returns 3 different values of mean and standard deviation. I am trying to find that if there exists any overlappping between any two distribution. To do that, I am trying find the distances between the distribution.

Standard code for KL Div looks like this and generally takes the argument, two arrays of probabilities corresponding to two different distributions.

Now my question is
1. How do I change the equation to work on mean and sigma?
2. How do I come to the conclusion that the distributions are overlapping by looking at the return value?

Best Answer

You can compute pairwise KL divergence as a function of parameters in closed form for two Gaussian distributions $p$ and $q$. The uni-variate case:

$KL(p||q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_{1}^{2} + (\mu_1-\mu_2)^2}{2\sigma_{2}^{2}} - \frac{1}{2}$

and the multi-variate case:

$KL(p||q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr} (\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]$

as derived here and here. Alternatively, you can try visualizing the cluster overlap by plotting the density of the mixture components.