[Math] Kullback-Leibler divergence based kernel

information theorymachine learningnumerical methodsprobability theory

I'm looking at this paper: "A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications". The author suggests using the kernel function for two distributions $p$ and $q$: $k(p,q)= \exp (-a (D_{KL}(p,q) + D_{KL}(q,p)))$, where $a>0$ and $D_{KL}$ is the Kullback-Leibler divergence between $p$ and $q$. But it's not obvious from this paper that this kernel is positive definite.
How it can be proved that the kernel is positive definite?

Also, it's well known that $\exp (-a (D_{KL}(p,q) + D_{KL}(q,p)))$ can be positive definite if and only if $(D_{KL}(p,q) + D_{KL}(q,p))$ is a negative definite kernel. How can one proof that fact?

Best Answer

If you have a kernel of the form: $K(x,y) = \exp^{-a(M(x,y))}$, all is needed is for $M(x,y)$ to be a valid metric. So all that is required is to prove that the Symmetrised K-L Divergence (call it $KLS(p,q)$) is a valid metric.

For all x, y, z in X, this function is required to satisfy the following conditions:

  1. $d(x, y) \geq 0$ (non-negativity)
  2. $d(x, y) = 0 \iff x = y$ (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
  3. $d(x, y) = d(y, x)$ (symmetry)
  4. $d(x, z) ≤ d(x, y) + d(y, z)$ (subadditivity / triangle inequality).

1 and 2 hold for each of $KL(p,q)$ and $KL(q,p)$ and therefore hold for $KLS(p,q)$. 3 holds trivially.

However 4 does not hold:

Counter example Consider a=[0.3 0.3 0.4] b=[0.25 0.35 0.4] c=[0.16 0.33 0.51]

we have

$KL(a||b)+KL(b||a)+KL(b||c)+KL(c||b)-[KL(a||c)+KL(c||a)]\approx -0.0327<0$

So $KLS(p,q)$ is not a valid metric.

Unless I've missed something, I do not believe that their kernels are necessarily positive definite - I'm assuming that it wasn't discussed in the review process otherwise I'd expect to see it discussed in the paper. Practically, it may not be a problem, as for their real world examples the matrices may have been (at least close to) SPSD, and with appropriate regularisation (even just adding a small constant to the diagonal) the algorithms should still work. There is also some work in solving SVMs for indefinite kernels, see e.g. Training SVM with Indefinite Kernels or Analysis of SVM with Indefinite Kernels so all is not lost even if the kernels are indefinite.

It's interesting that their results are so much better than using Fisher kernels - in my experience too, Fisher kernels don't work that well - so this is potentially a nice way of combining generative and discriminative methods. Let us know how you get on if you get round to using them!!