The KL divergence is a difference of integrals of the form
$$\begin{aligned}
I(a,b,c,d)&=\int_0^{\infty} \log\left(\frac{e^{-x/a}x^{b-1}}{a^b\Gamma(b)}\right) \frac{e^{-x/c}x^{d-1}}{c^d \Gamma(d)}\, \mathrm dx \\
&=-\frac{1}{a}\int_0^\infty \frac{x^d e^{-x/c}}{c^d\Gamma(d)}\, \mathrm dx
- \log(a^b\Gamma(b))\int_0^\infty \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\
&\quad+ (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\
&=-\frac{cd}{a}
- \log(a^b\Gamma(b))
+ (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\,\mathrm dx
\end{aligned}$$
We just have to deal with the right hand integral, which is obtained by observing
$$\eqalign{
\frac{\partial}{\partial d}\Gamma(d) =& \frac{\partial}{\partial d}\int_0^{\infty}e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx\\
=& \frac{\partial}{\partial d} \int_0^\infty e^{-x/c} \frac{(x/c)^{d-1}}{c}\, \mathrm dx\\
=&\int_0^\infty e^{-x/c}\frac{x^{d-1}}{c^d} \log\frac{x}{c} \, \mathrm dx\\
=&\int_0^{\infty}\log(x)e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx - \log(c)\Gamma(d).
}$$
Whence
$$\frac{b-1}{\Gamma(d)}\int_0^{\infty} \log(x)e^{-x/c}(x/c)^{d-1}\, \mathrm dx = (b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$
Plugging into the preceding yields
$$I(a,b,c,d)=\frac{-cd}{a} -\log(a^b\Gamma(b))+(b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$
The KL divergence between $\Gamma(c,d)$ and $\Gamma(a,b)$ equals $I(c,d,c,d) - I(a,b,c,d)$, which is straightforward to assemble.
Implementation Details
Gamma functions grow rapidly, so to avoid overflow don't compute Gamma and take its logarithm: instead use the log-Gamma function that will be found in any statistical computing platform (including Excel, for that matter).
The ratio $\Gamma^\prime(d)/\Gamma(d)$ is the logarithmic derivative of $\Gamma,$ generally called $\psi,$ the digamma function. If it's not available to you, there are relatively simple ways to approximate it, as described in the Wikipedia article.
Here, to illustrate, is a direct R
implementation of the formula in terms of $I$. This does not exploit an opportunity to simplify the result algebraically, which would make it a little more efficient (by eliminating a redundant calculation of $\psi$).
#
# `b` and `d` are Gamma shape parameters and
# `a` and `c` are scale parameters.
# (All, therefore, must be positive.)
#
KL.gamma <- function(a,b,c,d) {
i <- function(a,b,c,d)
- c * d / a - b * log(a) - lgamma(b) + (b-1)*(psigamma(d) + log(c))
i(c,d,c,d) - i(a,b,c,d)
}
print(KL.gamma(1/114186.3, 202, 1/119237.3, 195), digits=12)
Best Answer
Logistic regression is a form of binomial regression, so this will reduce to the KL divergence between two binomials. Since the probabilities depend on the covariate $x_i$, this will give a value depending on $i$, maybe you then are interested in the sum or in the average. I will not address that aspect, just look at the value for one $i$. I will use the notation and intuition from Intuition on the Kullback-Leibler (KL) Divergence It is natural to think about the intercept-only model as the null hypothesis, so that model will play the role of $Q$ in $$ \DeclareMathOperator{\KL}{KL} \KL(P || Q) = \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)} \; dx $$ where we for the binomial case will replace the integral with a sum over the two values $x=0,1$. Write $$ p=p_i = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0+\beta_1 x_i}},\quad q=q_i =\frac{e^{\gamma_0}}{1+e^{\gamma_0}} $$ where we use $q$ for the probability of the intercept-only model. I wrote the intercept differently in the two models because when estimating the two models on the same data we will not get the same intercept.
Then we only have to calculate the Kullback-Leibler divergence between the two binomial distributions, which is $$ KL(p || q) = (1-p)\log\frac{1-p}{1-q} + p \log\frac{p}{q} $$ As to your more general question "Is there any reference to the kullback leibler divergence between two GLM models? " GLM's are constructed from exponential families, like above logistic regression from binomial distributions, an exponential family. So like in this answer your question reduces to Kullback-Leibler divergence in exponential families. A reference to such results can be found in my answer at Quantify Difference/Distance between Lognormal distributions