Solved – software library to compute KL divergence

entropykullback-leiblermathematicasoftware

Are there any software libraries that compute KL divergences in closed form, that also give the derivatives of the KL divergence wrt the distributions' parameters? I'm using Julia, so it's particularly straightforward for me to call Julia, Fortran, C and C++ libraries.

Alternatively, if libraries like this don't exist, is there something I can do that would be easier than manually coding the KL divergences, and perhaps using automatic differentiation? I have to compute KL divergences for about 10 pairs of distribution with closed-form KL divergences, e.g. beta/beta, log-normal/log-normal, mv-normal/mv-normal, wrapped-cauchy/uniform.

Best Answer

I ended up coding KL divergences and derivatives myself in Julia. I've released it as part of an existing open source project. Future readers may find the code at this page of the Celeste.jl project.

Related Solutions

Entropy – Why KL-Divergence Uses ln in Its Formula

This is somewhat intuitive, I hope I give some ideas.

The KL divergence has several mathematical meanings. Although it is used to compare distributions, it comes from the field of information theory, where it measures how much "information" is lost when coding a source using a different distribution other than the real one. In information theory, it can be also defined as the difference between entropies - the joint entropy of $Q$ and $P$ and the entropy of $P$.

So to discuss KL divergence, we need to understand the meaning of entropy. The entropy is the measure of "information" in a source, and generally describes how "surprised" you will be with the outcome of the random variable. For instance, if you have a uniform distribution, you will always be "surprised" because there is a wide range of variables it can take. It has high entropy. However, if the RV is a coin with $p=0.9$, then you will probably not be surprised, because it will succeed 90% of the times, so it has low entropy.

Entropy is defined as $H(X)=-\sum_x P(x)\log P(x)=E[-\log P(X)]$, which is the expectation of $I(X)$, the information of a source. Why the log? one reason is the logarithm property of $\log(xy)=\log(x)+\log(y)$, meaning the information of a source composed of independent sources ($p(x)=p_1(x)p_2(x)$) will have the sum of their information. This can only happen by using a logarithm.

Entropy – What is the Difference Between Cross-Entropy and KL Divergence

You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.

Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as \begin{equation} S(v)=-\sum_ip(v_i)\log p(v_i)\label{eq:entropy}, \end{equation} for $p(v_i)$ as the probabilities of different states $v_i$ of the system. From an information theory point of view, $S(v)$ is the amount of information is needed for removing the uncertainty.

For instance, the event $I$ I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event $II$ I will die within 50 years is more uncertain than event $I$, thus it needs more information to remove the uncertainties. Here entropy can be used to quantify the uncertainty of the distribution When will I die?, which can be regarded as the expectation of uncertainties of individual events like $I$ and $II$.

Now look at the definition of KL divergence between distributions A and B \begin{equation} D_{KL}(A\parallel B) = \sum_ip_A(v_i)\log p_A(v_i) - p_A(v_i)\log p_B(v_i)\label{eq:kld}, \end{equation} where the first term of the right hand side is the entropy of distribution A, the second term can be interpreted as the expectation of distribution B in terms of A. And the $D_{KL}$ describes how different B is from A from the perspective of A. It's worth of noting $A$ usually stands for the data, i.e. the measured distribution, and $B$ is the theoretical or hypothetical distribution. That means, you always start from what you observed.

To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions $A$ and $B$ as \begin{equation} H(A, B) = -\sum_ip_A(v_i)\log p_B(v_i)\label{eq:crossentropy}. \end{equation} From the definitions, we can easily see \begin{equation} H(A, B) = D_{KL}(A\parallel B)+S_A\label{eq:entropyrelation}. \end{equation} If $S_A$ is a constant, then minimizing $H(A, B)$ is equivalent to minimizing $D_{KL}(A\parallel B)$.

A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as $P(\mathcal D)$) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as $P(model)$) as close as possible to true distribution of the problem (denoted as $P(truth)$). $P(truth)$ is unknown and represented by $P(\mathcal D)$. Therefore in an ideal world, we expect \begin{equation} P(model)\approx P(\mathcal D) \approx P(truth) \end{equation} and minimize $D_{KL}(P(\mathcal D)\parallel P(model))$. And luckily, in practice $\mathcal D$ is given, which means its entropy $S(D)$ is fixed as a constant.

Best Answer

Related Solutions

Entropy – Why KL-Divergence Uses ln in Its Formula

Entropy – What is the Difference Between Cross-Entropy and KL Divergence

Related Question