The KL divergence is a difference of integrals of the form
$$\begin{aligned}
I(a,b,c,d)&=\int_0^{\infty} \log\left(\frac{e^{-x/a}x^{b-1}}{a^b\Gamma(b)}\right) \frac{e^{-x/c}x^{d-1}}{c^d \Gamma(d)}\, \mathrm dx \\
&=-\frac{1}{a}\int_0^\infty \frac{x^d e^{-x/c}}{c^d\Gamma(d)}\, \mathrm dx
- \log(a^b\Gamma(b))\int_0^\infty \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\
&\quad+ (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\
&=-\frac{cd}{a}
- \log(a^b\Gamma(b))
+ (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\,\mathrm dx
\end{aligned}$$
We just have to deal with the right hand integral, which is obtained by observing
$$\eqalign{
\frac{\partial}{\partial d}\Gamma(d) =& \frac{\partial}{\partial d}\int_0^{\infty}e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx\\
=& \frac{\partial}{\partial d} \int_0^\infty e^{-x/c} \frac{(x/c)^{d-1}}{c}\, \mathrm dx\\
=&\int_0^\infty e^{-x/c}\frac{x^{d-1}}{c^d} \log\frac{x}{c} \, \mathrm dx\\
=&\int_0^{\infty}\log(x)e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx - \log(c)\Gamma(d).
}$$
Whence
$$\frac{b-1}{\Gamma(d)}\int_0^{\infty} \log(x)e^{-x/c}(x/c)^{d-1}\, \mathrm dx = (b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$
Plugging into the preceding yields
$$I(a,b,c,d)=\frac{-cd}{a} -\log(a^b\Gamma(b))+(b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$
The KL divergence between $\Gamma(c,d)$ and $\Gamma(a,b)$ equals $I(c,d,c,d) - I(a,b,c,d)$, which is straightforward to assemble.
Implementation Details
Gamma functions grow rapidly, so to avoid overflow don't compute Gamma and take its logarithm: instead use the log-Gamma function that will be found in any statistical computing platform (including Excel, for that matter).
The ratio $\Gamma^\prime(d)/\Gamma(d)$ is the logarithmic derivative of $\Gamma,$ generally called $\psi,$ the digamma function. If it's not available to you, there are relatively simple ways to approximate it, as described in the Wikipedia article.
Here, to illustrate, is a direct R
implementation of the formula in terms of $I$. This does not exploit an opportunity to simplify the result algebraically, which would make it a little more efficient (by eliminating a redundant calculation of $\psi$).
#
# `b` and `d` are Gamma shape parameters and
# `a` and `c` are scale parameters.
# (All, therefore, must be positive.)
#
KL.gamma <- function(a,b,c,d) {
i <- function(a,b,c,d)
- c * d / a - b * log(a) - lgamma(b) + (b-1)*(psigamma(d) + log(c))
i(c,d,c,d) - i(a,b,c,d)
}
print(KL.gamma(1/114186.3, 202, 1/119237.3, 195), digits=12)
As far as I can understand you are solving the following problem:
there are two analytical distributions $p(x)$ and $q(x)$, and you want to calculate distance between them, $D(p, q)$.
There are a plenty of measures of distance between two distributions:
I suggest you to try a few from the list above as all of them are rather easy to implement. In most applications numerical experiments is what give you a key to success. Then you can select one that suits you the best (as I haven't found any requirements for this distance in your question I can't suggest you anything else).
As $p(x)$ and $q(x)$ are complex it is almost impossible that there exists analytical expression for some $D(p, q)$, so you will need a numerical way to calculate those distances. Note, that calculation almost all distances involves numerical integration - so they will be rather imprecise if $x$ dimension is high.
Best Answer
I'd like to add the first answer, which would be unsatisfying, to this question through the lens of deep learning mostly in NLP:
First things first,
Let's see the definition (in terms of your question):
$$ KL(q||p)=\sum q(s)\log \frac{q(s)}{p(s)} $$ When $p(s) > 0$ and $q(s)\to 0$, the KL divergence shrinks to 0, which means MLE assigns an extremely low cost to the scenarios, where the model generates some samples that do not locate on the data distribution.
Consider this, the corpus in hand includes the whole samples existing in the world then $q(s) \to 0$ indicates that $s$ occurs very rarely in the corpus (the law of large number), and then its probability may happen to be very large (due to samples lookalike but different or opposite in fact). In this case, because of the lack of training for this kind of category and hence its high probabilities in the distribution, such rare samples that do not locate on the data distribution may be generated while we are testing or validating.
For your sub-questions:
You can refer to this answer which states that "Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression". Note that training by cross entropy is the same as training using relative entropy. For the details please refer to this.
If I understand your question correctly, I suppose it can be falsified by the loss function for SVMs. Please refer to this question and this answer. Kullback-Leibler divergence can not solve all problems in estimation.