Solved – How to calculate Kullback-Leibler divergence/distance

kullback-leibler

I have three data sets X, Y and Z. Each data set defines the frequency of an event occurring. For example:

Data Set X: E1:4, E2:0, E3:10, E4:5, E5:0, E6:0 and so on..
Data Set Y: E1:2, E2:3, E3:7, E4:6, E5:0, E6:0 and so on..
Data Set Z: E1:0, E2:4, E3:8, E4:4, E5:1, E6:0 and so on..

I have to find KL-divergence between X and Y; and between X and Z. As you can see for some of the events there will be 0 and non-zero values. For some of the events all three data sets are 0.

I would appreciated if someone can help me find the KL divergence for this. I am not much of a statistician, so I don't have much idea. The tutorials I was looking at online were a bit too complex for my understanding.

Best Answer

To answer your question, we should recall the definition of KL divergence:

$$D_{KL}(Y||X) = \sum_{i=1}^N \ln \left( \frac{Y_i}{X_i} \right) Y_i$$

First of all you have to go from what you have to probability distributions. For this you should normalize your data such that it sums up to one:

$X_i := \frac{X_i}{\sum_{i=1}^N X_i}$; $Y_i := \frac{Y_i}{\sum_{i=1}^N Y_i}$; $Z_i := \frac{Z_i}{\sum_{i=1}^N Z_i}$

Then, for discrete values we have one very important assumption that is needed to evaluate KL-divergence and that is often violated:

$X_i = 0$ should imply $Y_i = 0$.

In case when both $X_i$ and $Y_i$ equals to zero, $\ln \left( Y_i / X_i \right) Y_i$ is assumed to be zero (as the limit value).

In your dataset it means that you can find $D_{KL}(X||Y)$, but not for example $D_{KL}(Y||X)$ (because of second entry).

What I could advise from practical point of view is:

either make your events "larger" such that you will have less zeros

or gain more data, such that you will cover even rare events with at least one entry.

If you can use neither of the advices above, then you will probably need to find another metric between the distributions. For example,

Mutual information, defined as $I(X, Y) = \sum_{i=1}^N \sum_{j=1}^N p(X_i, Y_j) \ln \left( \frac{p(X_i, Y_j)}{p(X_i) p(Y_j)} \right)$. Where $p(X_i, Y_i)$ is a joint probability of two events.

Hope it will help.