I have three data sets X, Y and Z. Each data set defines the frequency of an event occurring. For example:
Data Set X: E1:4, E2:0, E3:10, E4:5, E5:0, E6:0 and so on..
Data Set Y: E1:2, E2:3, E3:7, E4:6, E5:0, E6:0 and so on..
Data Set Z: E1:0, E2:4, E3:8, E4:4, E5:1, E6:0 and so on..
I have to find KL-divergence between X and Y; and between X and Z. As you can see for some of the events there will be 0 and non-zero values. For some of the events all three data sets are 0.
I would appreciated if someone can help me find the KL divergence for this. I am not much of a statistician, so I don't have much idea. The tutorials I was looking at online were a bit too complex for my understanding.
Best Answer
To answer your question, we should recall the definition of KL divergence:
$$D_{KL}(Y||X) = \sum_{i=1}^N \ln \left( \frac{Y_i}{X_i} \right) Y_i$$
First of all you have to go from what you have to probability distributions. For this you should normalize your data such that it sums up to one:
$X_i := \frac{X_i}{\sum_{i=1}^N X_i}$; $Y_i := \frac{Y_i}{\sum_{i=1}^N Y_i}$; $Z_i := \frac{Z_i}{\sum_{i=1}^N Z_i}$
Then, for discrete values we have one very important assumption that is needed to evaluate KL-divergence and that is often violated:
In case when both $X_i$ and $Y_i$ equals to zero, $\ln \left( Y_i / X_i \right) Y_i$ is assumed to be zero (as the limit value).
In your dataset it means that you can find $D_{KL}(X||Y)$, but not for example $D_{KL}(Y||X)$ (because of second entry).
What I could advise from practical point of view is:
If you can use neither of the advices above, then you will probably need to find another metric between the distributions. For example,
Hope it will help.