i have have calculated kullback-leibler divergence which is equal 0.492820258
The KL divergence is not a dimensionless number, it has an unit (which depends on the base of the logarithm used), you must specify it unless it's implied by the context. I guess you used base 2
, hence the unit it bits.
The KL divergence (or "distance") is not symmetric: $D(p || q) \ne D(q|| p)$, so , again, you must specify which one you computed.
In our case $D(p||q)=0.49282$ bits.
Regarding the numerical significance, you could first compute the entropies. In our case $H(p)=1.9486...$ and $H(q)=0.5745...$ (always in bits). In terms of source encoding (first Shannon theorem), this says that source $p$ can be optimally encoded with $1.9486...$ bits per symbol. Now, if we encode source $p$ assumming (wrongly) that its true distribution were that of $q$, we'd get an average code length of $2.44142...$ bits per symbol (you can do the math). The "excess" error, the ineficciency that arises because of assuming a wrong distribution, is quantified by the KL divergence: $2.44142-1.9486=0.49282$ bits.
It just came to my mind that this follows immediately from the convexity of KL divergence. It's hard to find info about it on the web, but page 6 here
http://camo.ici.ro/journal/vol11/v11d15.pdf
describes everything you need in terms of discrete relative entropy $D_\text{KL}$ (the very same concept).
In your case,
\begin{align*}
\text{KL}(h,f)
&= \text{KL}(w \cdot h + (1-w) \cdot h, \ w \cdot h + (1-w) \cdot g) \\
&\leq w \cdot \text{KL}(h,h) + (1-w) \cdot \text{KL}(h,g)
= (1-w) \cdot \text{KL}(h,g)
\end{align*}
and summarized
$$\text{KL}(h,f) \leq(1-w) \cdot \text{KL}(h,g)$$
which is an even stronger inequality than what you asked for.
My previous text: Not a full answer but my thoughts.
So you want to show
$$\text{KL}(h,f) \leq \text{KL}(h,g)$$
or rather
$$\int_\mathbb{R} h(x) \log \frac{h(x)}{w \ h(x) + (1-w)g(x)} dx
\leq \int_\mathbb{R} h(x) \log \frac{h(x)}{g(x)} dx$$
\begin{align*}
0 &\leq \int_\mathbb{R} h(x) \log \frac{w \ h(x) + (1-w)g(x)}{g(x)} dx \\
0 &\leq \int_\mathbb{R} h(x) \log\left(1-w + w\frac{h(x)}{g(x)}\right) dx
\end{align*}
Now this cannot be easy to show directly because the proof for positivity of KL-divergence (a.k.a. relative entropy) between distributions is already quite long and tricky. You could carefully read proofs of said theorem and try to adapt them to the problem at hand.
Alternatively, you could rewrite the above to
\begin{align*}
0 &\leq \int_\mathbb{R} h(x) \log \frac{h(x)}{a_w(x)} dx = \text{KL}(h,a_w)
\end{align*}
with
$$a_w(x) := \frac{1}{\frac{w}{g(x)} + \frac{1-w}{h(x)}} \ .$$
If you can show that $a_w(x)$ is a probability distribution $\forall w \in [0,1]$ (maybe it isn't, I don't know), then $0 \leq \text{KL}(h,a_w)$ and you're done.
Best Answer
If you have a kernel of the form: $K(x,y) = \exp^{-a(M(x,y))}$, all is needed is for $M(x,y)$ to be a valid metric. So all that is required is to prove that the Symmetrised K-L Divergence (call it $KLS(p,q)$) is a valid metric.
For all x, y, z in X, this function is required to satisfy the following conditions:
1 and 2 hold for each of $KL(p,q)$ and $KL(q,p)$ and therefore hold for $KLS(p,q)$. 3 holds trivially.
However 4 does not hold:
Counter example Consider a=[0.3 0.3 0.4] b=[0.25 0.35 0.4] c=[0.16 0.33 0.51]
we have
$KL(a||b)+KL(b||a)+KL(b||c)+KL(c||b)-[KL(a||c)+KL(c||a)]\approx -0.0327<0$
So $KLS(p,q)$ is not a valid metric.
Unless I've missed something, I do not believe that their kernels are necessarily positive definite - I'm assuming that it wasn't discussed in the review process otherwise I'd expect to see it discussed in the paper. Practically, it may not be a problem, as for their real world examples the matrices may have been (at least close to) SPSD, and with appropriate regularisation (even just adding a small constant to the diagonal) the algorithms should still work. There is also some work in solving SVMs for indefinite kernels, see e.g. Training SVM with Indefinite Kernels or Analysis of SVM with Indefinite Kernels so all is not lost even if the kernels are indefinite.
It's interesting that their results are so much better than using Fisher kernels - in my experience too, Fisher kernels don't work that well - so this is potentially a nice way of combining generative and discriminative methods. Let us know how you get on if you get round to using them!!