[Math] Generalisations of the Kullback-Leibler divergence for more than two distributions

it.information-theory

A very fundamental quantity in information theory is the Kullback-Leibler divergence between two probability distributions over the same random variable, $D_{KL}(Q\|P) = \sum_i q_i \log\frac{q_i}{p_i}$. This can be interpreted as the amount of information gained when changing from a prior $P$ to a posterior $Q$. Many if not all of the other important quantities in information theory can be seen as special cases of the KL divergence.

I'm interested in generalisations of the KL divergence to cases where we have three or more distributions rather than just two. In particular, I'm interested in fully understanding the case where one moves from a prior $P$ to a posterior $Q$, but then receives some additional information to arrive at a second posterior $R$. Is there literature on extending or generalising the KL divergence to deal with this three variable case?

Of course one can consider $D_{KL}(R\|Q)$ and $D_{KL}(R\|P)$ as well as $D_{KL}(Q\|P)$, and one can also consider something of the form $\sum_i r_i \log\frac{q_i}{p_i}$, which is really $D_{KL}(R\|P)-D_{KL}(Q\|P)$. This latter quantity is interesting because it can be negative as well as positive, and when it's negative it indicates that the information leading to $Q$ was misleading, or contradictory with the information that led to $R$.

However, these quantities seem not to tell the whole story by themselves. Intuitively, the information leading to $Q$ could be misleading in some ways but truthful in others, with $\sum_i r_i \log\frac{q_i}{p_i}$ only giving the total amount of misleadingness rather than the full decomposition. So it seems to me that there should be two nonnegative quantities (both functions of the probabilities $p_i$, $q_i$ and $r_i$) that measure the amount of misleading information $M$ and the amount of consistent information $C$, with $\sum_i r_i \log\frac{q_i}{p_i} = C-M$. It is not obvious to me what form these functions should have.

The above intuition might not be clear without a motivating example, so I'll include one below. But really my question is just whether anything along the lines of the above has been written about before, or indeed, whether anything has been written about three-distribution analogues of the KL divergence at all.

Appendix: a motivating example

The above question is self-contained, but an example will make it clearer why I'm asking. I'll start with an example of misleading information and an example of truthful information, and then I'll combine them into a single example. In the combined system the quantity $\sum_i r_i \log\frac{q_i}{p_i} = 0$, but intuitively this is because there is one bit of misleading information and one bit of truthful information, which sum to zero.

To start with let's consider a variable $X$ that can take the value 0 or 1. Suppose we have
$$
p(X=0)=1/2,\qquad p(X=1)=1/2,\\
q(X=0)=3/4,\qquad q(X=1)=1/4,\\
r(X=0)=0,\qquad r(X=1)=1.
$$
On the first piece of information (taking us from $P$ to $Q$), we're led to believe that $X$ is more likely to be zero, but on the second piece of information (taking us from $Q$ to $R$) we find that $X=1$. So intuitively the first piece of information is misleading. Quantitatively, $\sum_i r_i \log\frac{q_i}{p_i} = \log_2\frac{1/4}{1/2} = -1\,\text{bit}$, suggesting that in some sense the information in $Q$ took us one bit "in the wrong direction," away from the "true" posterior $R$.

As a second example, consider another Boolean variable $Y$ with the probabilities
$$
p(Y=0)=1/2,\qquad p(Y=1)=1/2,\\
q(Y=0)=0,\qquad q(Y=1)=1,\\
r(Y=0)=0,\qquad r(Y=1)=1.
$$
Now intuition tells us that all the information about $Y$ was already in $Q$, with $R$ telling us nothing extra. We find that $\sum_i r_i \log\frac{q_i}{p_i} = 1\,\text{bit}$, indicating that $Q$ took us one bit towards the correct posterior.

Let us now create a new random variable, $Z$, from the combined states of $X$ and $Y$, assuming that $X$ and $Y$ are independent. This gives us
$$
p(Z=00)=1/4,\qquad p(Z=01)=1/4,\qquad p(Z=10)=1/4,\qquad p(Z=11)=1/4,\\
q(Z=00)=0,\qquad q(Z=01)=3/4,\qquad q(Z=10)=0,\qquad q(Z=11)=1/4,\\
r(Z=00)=0,\qquad r(Z=01)=0,\qquad r(Z=10)=0,\qquad r(Z=11)=1.\\
$$
We can now calculate that for $Z$ we have $\sum_i r_i \log\frac{q_i}{p_i} = 0$. But intuitively, this is because we received one misleading bit about $X$ and one truthful bit about $Y$, with zero being the sum of these two independent pieces of information. The information in $Q$ correctly told us that $Z$ was neither equal to $00$ nor $10$, but at the same time it incorrectly indicated that $Z=01$ was more likely than $Z=11$, so it contained some truth and some falsehood simultaneously.

My question is about whether this last intuition can be captured formally via a function that looks only at the probabilities listed above for the values of $Z$. This function would return the value 1 bit for the probabilities above, indicating that $Q$ provides one bit of correct information about $Z$, and allowing us to deduce that it also contains one bit of misleading information.

I am interested both in whether such a function has explicitly been defined and studied, and in any related concepts that might help me to formulate it myself.

Best Answer

The Kullback-Leibler divergence $D_{\rm KL}(Q||P)$ of two distributions $Q,P$ has been generalized to multiple distributions in various ways:

[1] information radius: $R(P_1,\ldots P_k)=\frac{1}{k}\sum_{i=1}^k D_{\rm KL}(P_i||k^{-1}\sum_i P_i)$

[2] average divergence: $K(P_1,\ldots P_k)=\frac{1}{k(k-1)}\sum_{i,j=1}^k D_{\rm KL}(P_i||P_j)$

[3,4] dissimilarity: the weighted arithmetic mean of the KL distances between each of the $P_i$’s and the barycenter of all the $P_i$’s

References

[1] Robin Sibson, Information radius. Probability Theory and Related Fields, 14, 149–160 (1969).

[2] Andrea Sgarro, Informational divergence and the dissimilarity of probability distributions. Calcolo, 18, 293–302 (1981).

[3] Michèlle Basseville, Divergence measures for statistical data processing (2010).

[4] Darío García-García and Robert C. Williamson, Divergences and Risks for Multiclass Experiments (2012).