Solved – Kullback-Leibler Divergence for two samples

estimationintuitionkullback-leiblernumerics

I tried to implement a numerical estimate of the Kullback-Leibler Divergence for two samples. To debug the implementation draw the samples from two normal distributions $\mathcal N (0,1)$ and $\mathcal N (1,2)$.

For a simple estimate I generated two histograms and tried to numerically approximate the integral. I got stuck with handling those parts of the histogram where the bins of one of the histogram are zero such that I either end up with dividing by zero or the logarithm of zero. How do I handle this issue?

A related question came in my mind:
How to exactly compute the KL-Divergence between two different uniform distributions? Do I have to restrict the integral to the union of the support of both distributions?

Best Answer

The Kullback-Leibler divergence is defined as $$ \DeclareMathOperator{\KL}{KL} \KL(P || Q) = \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)} \; dx $$ so to calculate (estimate) this from empirical data we would need, maybe, some estimates of the density functions $p(x), q(x)$. So a natural starting point could be via density estimation (and after that, just numerical integration). How good or stable such a method would be I don't know.

But first your second question, then I will return to the first one. Lets say $p$ and $q$ are uniform densities on $[0,1]$ and $[0,10]$ respectively. Then $\KL(p || q)=\log 10$ while $\KL(q || p)$ is more difficult to define, but the only reasonable value to give it is $\infty$, as far as I can see, since it involves integrating $\log(1/0)$ which we can choose to interprete as $\log \infty$. This results are reasonable from the interpretation I give in Intuition on the Kullback-Leibler (KL) Divergence

Returning to the main question. It is asked in a very nonparametric way, and no assumptions are stated on the densities. Probably some assumptions are needed. But assuming the two densities are proposed as competing models for the same phenomenon, we can probably assume they have the same dominating measure: KL divergence between a continuous and a discrete probability distribution would always be infinity, for example. One paper addressing this question is the following: https://pdfs.semanticscholar.org/1fbd/31b690e078ce938f73f14462fceadc2748bf.pdf They propose a method which do not need preliminary density estimation, and analyses its properties.

(There are many other papers). I will come back and post some details from that paper, the ideas.

 EDIT               

Some ideas from that paper, which is about estimation of KL divergence with iid samples from absolutely continuous distributions. I show their proposal for one-dimensional distributions, but they give a solution for vectors also (using nearest neighbor density estimation). For proofs read the paper!

They propose to use a version of the empirical distribution function, but interpolated linearly between sample points to get a continuous version. They define $$ P_e(x) = \frac1{n}\sum_{i=1}^n U(x-x_i) $$ where $U$ is the Heavyside step function, but defined so that $U(0)=0.5$. Then that function interpolated linearly (and extended horizontally beyond the range) is $P_c$ ($c$ for continuous). Then they propose to estimate the Kullback-Leibler divergence by $$ \hat{D}(P \| Q) = \frac1{n}\sum_{i=1}^n \log\left(\frac{\delta P_c(x_i)}{\delta Q_c(x_i)}\right) $$ where $\delta P_c = P_c(x_i)-P_c(x_i-\epsilon)$ and $\epsilon$ is a number smaller than the smallest spacing of the samples.

R code for the version of the empirical distribution function that we need is

my.ecdf  <-  function(x)   {
    x   <-   sort(x)
    x.u <-   unique(x)
    n  <-  length(x) 
    x.rle  <-  rle(x)$lengths
    y  <-  (cumsum(x.rle)-0.5) / n
    FUN  <-  approxfun(x.u, y, method="linear", yleft=0, yright=1,
                           rule=2)
    FUN
}          

note that rle is used to take care of the case with duplicates in x.

Then the estimation of the KL divergence is given by

KL_est  <-  function(x, y)   {
    dx  <-  diff(sort(unique(x)))
    dy  <-  diff(sort(unique(y)))
    ex  <-  min(dx) ; ey  <-  min(dy)
    e   <-  min(ex, ey)/2
    n   <-  length(x)    
    P  <-   my.ecdf(x) ; Q  <-  my.ecdf(y)
    KL  <-  sum( log( (P(x)-P(x-e))/(Q(x)-Q(x-e)))) / n
    KL              
}

Then I show a small simulation:

KL  <-  replicate(1000, {x  <-  rnorm(100)
                         y <- rt(100, df=5)
                         KL_est(x, y)})
hist(KL, prob=TRUE)

which gives the following histogram, showing (an estimation) of the sampling distribution of this estimator:

Sampling distribution of KL estimator

For comparison, we calculate the KL divergence in this example by numerical integration:

LR  <-  function(x) dnorm(x,log=TRUE)-dt(x,5,log=TRUE)
100*integrate(function(x) dnorm(x)*LR(x),lower=-Inf,upper=Inf)$value
[1] 3.337668

hmm ... the difference being large enough that there is much here to investigate!