optimization – Understanding Extra Terms $-p_i+q_i$ in SciPy’s Kullback-Leibler Divergence

convexinformation theorykullback-leibleroptimization

Why do some definitions of the Kullback-Leibler divergence include extra terms $-p_i + q_i$? For example, kl_div() (in the Python scipy.special module) defines the Kullback-Leibler divergence as
$$
\sum_i p_i \ln\frac{p_i}{q_i} – p_i + q_i.
$$

The documentation says:

The origin of this function is in convex programming; see [1] for details. This is why the function contains the extra terms over what might be expected from the Kullback-Leibler divergence.

I don't have the referenced book at hand. What is the justification or motivation for the additional $-p_i + q_i$ terms?

Anti-closing note: This is not a question about software, but about the concept behind it.

Best Answer

The other answer tells us why we don't usually see the $-p_i+q_i$ term: $p$ and $q$ are usually residents of the simplex and so sum to one, so this leads to $\sum - [p_i - q_i] = \sum - p_i + \sum q_i = -1 + 1 = 0$.

In this answer, I want to show why those terms are there in the first place, by viewing KL divergence as the Bregman divergence induced by the (negative) Entropy function.

Given some differentiable function $\psi$, the Bregman divergence induced by it is a binary function on the domain of $\psi$:

$$ B_\psi(p,q) = \psi(p)-\psi(q)-\langle\nabla\psi(q),p-q\rangle $$

Intuitively, the Bregman divergence measures the difference between $\psi$ evaluated at $p$ and the linear approximation to $\psi$ (about $q$) evaluated at $p$. When $\psi$ is convex, this is guaranteed to be nonnegative, and thus so is the Bregman divergence.

Noting that if $\psi(p) = \sum_i p_i \log p_i$, $\nabla\psi(p) = [\log p_i + 1]$, the entropic Bregman divergence is thus:

$$ B_e(p,q) = \sum_i p_i \log p_i - \sum_i q_i \log q_i - \sum_i [\log q_i + 1][p_i-q_i]\\ = \sum_i p_i \log p_i - \sum_i q_i \log q_i - \sum_i [\log q_i (p_i-q_i) + p_i-q_i]\\ = \sum_i p_i \log p_i - \sum_i q_i \log q_i - \sum_i p_i \log q_i + \sum_i q_i\log q_i - \sum_i[p_i-q_i]\\ = \sum_i p_i \log p_i - \sum_i p_i \log q_i - \sum_i[p_i-q_i]\\ = \sum_i p_i \log \frac{p_i}{q_i} + \sum_i[-p_i+q_i] $$

which we recognize as the KL divergence you mentioned.