If you have a kernel of the form: $K(x,y) = \exp^{-a(M(x,y))}$, all is needed is for $M(x,y)$ to be a valid metric. So all that is required is to prove that the Symmetrised K-L Divergence (call it $KLS(p,q)$) is a valid metric.
For all x, y, z in X, this function is required to satisfy the following conditions:
- $d(x, y) \geq 0$ (non-negativity)
- $d(x, y) = 0 \iff x = y$ (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
- $d(x, y) = d(y, x)$ (symmetry)
- $d(x, z) ≤ d(x, y) + d(y, z)$ (subadditivity / triangle inequality).
1 and 2 hold for each of $KL(p,q)$ and $KL(q,p)$ and therefore hold for $KLS(p,q)$.
3 holds trivially.
However 4 does not hold:
Counter example
Consider
a=[0.3 0.3 0.4]
b=[0.25 0.35 0.4]
c=[0.16 0.33 0.51]
we have
$KL(a||b)+KL(b||a)+KL(b||c)+KL(c||b)-[KL(a||c)+KL(c||a)]\approx -0.0327<0$
So $KLS(p,q)$ is not a valid metric.
Unless I've missed something, I do not believe that their kernels are necessarily positive definite - I'm assuming that it wasn't discussed in the review process otherwise I'd expect to see it discussed in the paper. Practically, it may not be a problem, as for their real world examples the matrices may have been (at least close to) SPSD, and with appropriate regularisation (even just adding a small constant to the diagonal) the algorithms should still work. There is also some work in solving SVMs for indefinite kernels, see e.g. Training SVM with Indefinite Kernels or Analysis of SVM with Indefinite Kernels so all is not lost even if the kernels are indefinite.
It's interesting that their results are so much better than using Fisher kernels - in my experience too, Fisher kernels don't work that well - so this is potentially a nice way of combining generative and discriminative methods. Let us know how you get on if you get round to using them!!
[Too long for a comment, but will restore your faith in KL divergence].
Here's why I like KL-divergence. Let's say you have two probability measures $\mu$ and $\nu$ on some finite set $X$. Someone secretly chooses either $\mu$ or $\nu$. You receive a certain number $T$ of elements of $X$ chosen randomly and independently according to the secret measure. You want to guess the secret measure correctly with high probability. What do you do?
The best "algorithm" to follow would be to observe the $T$ samples $x_1,\dots,x_T$ and choose $\mu$ or $\nu$ based on which one is more likely to generate these $T$ samples. The probability that $\mu$ generates these samples is $\prod_{j=1}^T \mu(x_j)$, and the probability $\nu$ generates these samples is $\prod_{j=1}^T \nu(x_j)$. So, we choose $\mu$ iff $\prod_{j=1}^T \frac{\mu(x_j)}{\nu(x_j)} > 1$, which is the same as $\sum_{j=1}^T \log \frac{\mu(x_j)}{\nu(x_j)} > 0$. If we let $Z : X \to [0,\infty]$ be the random variable defined by $Z(x) = \log \frac{\mu(x)}{\nu(x)}$, then the expected value of $Z$ under $\mu$ is exactly the KL-divergence between $\mu$ and $\nu$. And then of course the sum of $T$ independent copies of $Z$ has expectation $T\cdot KL(\mu,\nu)$.
As $T$ increases, by the weak law of large numbers, the average $\frac{1}{T}\sum_{t \le T} Z_t$ converges in probability to $KL(\mu,\nu)$. The fact that $KL(\mu,\nu) > 0$ (if $\mu \not = \nu$) corresponds to the fact that our algorithm will succeed with probability tending to $1$ as $T \to \infty$.
The point is that KL-divergence is an important quantity when trying to distinguish between different distributions, based on observed samples.
Best Answer
Actually, no ${{{{}}}}{{{{{}}}}}$