If you have a kernel of the form: $K(x,y) = \exp^{-a(M(x,y))}$, all is needed is for $M(x,y)$ to be a valid metric. So all that is required is to prove that the Symmetrised K-L Divergence (call it $KLS(p,q)$) is a valid metric.
For all x, y, z in X, this function is required to satisfy the following conditions:
- $d(x, y) \geq 0$ (non-negativity)
- $d(x, y) = 0 \iff x = y$ (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
- $d(x, y) = d(y, x)$ (symmetry)
- $d(x, z) ≤ d(x, y) + d(y, z)$ (subadditivity / triangle inequality).
1 and 2 hold for each of $KL(p,q)$ and $KL(q,p)$ and therefore hold for $KLS(p,q)$.
3 holds trivially.
However 4 does not hold:
Counter example
Consider
a=[0.3 0.3 0.4]
b=[0.25 0.35 0.4]
c=[0.16 0.33 0.51]
we have
$KL(a||b)+KL(b||a)+KL(b||c)+KL(c||b)-[KL(a||c)+KL(c||a)]\approx -0.0327<0$
So $KLS(p,q)$ is not a valid metric.
Unless I've missed something, I do not believe that their kernels are necessarily positive definite - I'm assuming that it wasn't discussed in the review process otherwise I'd expect to see it discussed in the paper. Practically, it may not be a problem, as for their real world examples the matrices may have been (at least close to) SPSD, and with appropriate regularisation (even just adding a small constant to the diagonal) the algorithms should still work. There is also some work in solving SVMs for indefinite kernels, see e.g. Training SVM with Indefinite Kernels or Analysis of SVM with Indefinite Kernels so all is not lost even if the kernels are indefinite.
It's interesting that their results are so much better than using Fisher kernels - in my experience too, Fisher kernels don't work that well - so this is potentially a nice way of combining generative and discriminative methods. Let us know how you get on if you get round to using them!!
i have have calculated kullback-leibler divergence which is equal 0.492820258
The KL divergence is not a dimensionless number, it has an unit (which depends on the base of the logarithm used), you must specify it unless it's implied by the context. I guess you used base 2
, hence the unit it bits.
The KL divergence (or "distance") is not symmetric: $D(p || q) \ne D(q|| p)$, so , again, you must specify which one you computed.
In our case $D(p||q)=0.49282$ bits.
Regarding the numerical significance, you could first compute the entropies. In our case $H(p)=1.9486...$ and $H(q)=0.5745...$ (always in bits). In terms of source encoding (first Shannon theorem), this says that source $p$ can be optimally encoded with $1.9486...$ bits per symbol. Now, if we encode source $p$ assumming (wrongly) that its true distribution were that of $q$, we'd get an average code length of $2.44142...$ bits per symbol (you can do the math). The "excess" error, the ineficciency that arises because of assuming a wrong distribution, is quantified by the KL divergence: $2.44142-1.9486=0.49282$ bits.
Best Answer
The short answer is that KL divergence has a probabilistic/statistical meaning (and a lot of them, in fact) while Euclidean distance has not. For example, a given difference $f(x)-g(x)$ has a whole different meaning depending on the absolute sizes of $f(x)$ and $g(x)$.
The WP page on the subject is a must read, naturally. Let me explain only one interpretation of KL divergence. Assume a random i.i.d. sample $\mathfrak X=(x_k)_{1\leqslant k\leqslant n}$ follows the distribution $f$ and a random i.i.d. sample $\mathfrak Y=(y_k)_{1\leqslant k\leqslant n}$ follows the distribution $g$. A way to distinguish $\mathfrak X$ from $\mathfrak Y$ is to ask for the likelihood that $\mathfrak Y$ behaves like $\mathfrak X$, that is, that $\mathfrak Y$ behaves like a typical sample from $f$.
More precisely, one wants to estimate how unlikely $\mathfrak Y$ becomes when one asks that $\mathfrak Y$ behaves like an $f$ sample, compared to its ordinary likelihood as a $g$ sample.
The computation is rather simple and based on the following. Assume $N(x,x+\mathrm dx)$ values from the sample fall in each interval $(x,x+\mathrm dx)$. Then, the likelihood scales like $$ \prod g(x)^{N(x,x+\mathrm dx)}=\exp\left(\sum N(x,x+\mathrm dx)\log g(x)\right). $$ For a typical $f$ sample, $N(x,x+\mathrm dx)\approx nf(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ masquerading as an $f$ sample scales like $$ \ell_n(f\mid g)\approx\exp\left(n\int f(x)\log g(x)\mathrm dx\right). $$ On the other hand, for a typical $g$ sample, $N(x,x+\mathrm dx)\approx ng(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ behaving like a typical $g$ sample scales like $$ \ell_n(g\mid g)\approx\exp\left(n\int g(x)\log g(x)\mathrm dx\right). $$ Thus $\ell_n(f\mid g)\ll\ell_n(g\mid g)$, as was to be expected, and the ratio $\dfrac{\ell_n(f\mid g)}{\ell_n(g\mid g)}$ decreases exponentially fast when $n\to\infty$, approximately like $\mathrm e^{-nH}$, where $$ H=\int f(x)\log f(x)\mathrm dx-\int f(x)\log g(x)\mathrm dx=K(f\mid g). $$