Is this Kullback-Leibler divergence between positive semidefinite matrices always well-defined

convex-analysisinformation theorypositive-semidefinitequantum-informationsymmetric matrices

In section 1.3 of their paper Quantum Optimal Transport for Tensor Field Processing (arXiv link, published not-open access version here) the authors Gabriel Peyré, Lénaïc Chizat, François-Xavier Vialard and Justin Solomon describe the following procedure to define a Kullback-Leibler divergence between symmetric positive semidefinite matrices:

Let $\mathcal S^d$ be set of symmetric $n \times n$ matrices and $\mathcal S_+^d \subset \mathcal S^d$ the subset of positive semidefinite and $\mathcal S_{++}^d$ the subset of positive definite $n \times n$ matrices.
We want to extend the map
$$
\mathcal S_+^d \times \mathcal S_{++}^d \to \mathbb R^{n \times n}, \qquad
(P, Q) \mapsto P \log(Q)
$$
by lower semicontinuity to $\mathcal S_+^d \times \mathcal S_+^d$, where $\log \colon S_{++}^d \to S^d$ is the matrix logarithm.
Note that $\mathcal S_{++}^d = \big\{ P \in \mathcal S_+^d: \ker(P) = \{ 0 \}\big\}$, so for $P, Q \in \mathcal S_+^d$ with orthogonal eigendecomposition $Q = U \text{diag}_s(\sigma_s) U^T$ it makes sense to define
\begin{gather*}
\mathcal S_+^d \times \mathcal S_+^d \to \mathbb R^{n \times n} \cup \{ \infty \}, \qquad
(P, Q) \mapsto P \log(Q)
:= \begin{cases}
P \log(Q), & \text{if } \ker(Q) = \{ 0 \}, \\
U \big[ \tilde{P} \text{diag}_s\big(\log(\sigma_s)\big) \big] U^T, & \text{if }\ker(Q) \ne \{ 0 \} \land \ker(Q) \subset \ker(P), \\
\infty, & \text{else,}
\end{cases}
\end{gather*}
where $\tilde{P} := U^T P U \in \mathcal S_+^d$ and we use the convention $0 \log(0) := 0$ when computing the matrix product in square brackets. (Indeed all products involving $\log(0)$ are of the form $0 \cdot \log(0)$.)

(I don't know what the subscript $_s$ indicates.)

Now, for $P, Q \in \mathcal S_+^d$ their Kullback-Leibler divergence is
$$
\text{KL}(P; Q)
:= \text{tr}(P \log(P) – P \log(Q) + P – Q) + \iota_{\mathcal S_{++}^d}(P),
$$
where $$\iota_A(x) = \begin{cases} 0, & \text{if } x \in A, \\ \infty, & \text{else.} \end{cases}$$ is the convex indicator function.

My question

How this can be well-defined, that is, avoid indeterminate expressions of the form $\infty – \infty$?

Since the trace is linear (assuming this holds even if $P \log(Q) = \infty$, which is neither affirmed or denied in the paper), we can focus only on the "trouble-maker" term $$- \text{tr}(P \log(Q)) + \iota_{\mathcal S_{++}^d}(P).$$
Assuming the authors set $\text{tr}(P \log(Q)) = \infty$ if $P \log(Q)$, consider $P := \text{diag}(1, 1, 0)$ and $Q := \text{diag}(1, 0, 0)$.
Then $$\{ 0 \} \ne \ker(Q) = \text{span}(e_2, e_3) \supsetneq \text{span}(e_3) = \ker(P),$$ where $e_k \in \mathbb R^3$ is the $k$-th basis vector.
Hence $P \log(Q) = \infty$, but as $P \not\in \mathcal S_{++}^3$, the trouble-maker term is $- \infty + \infty$, which is indeterminate.

Note that counterexample can not be constructed in $d = 2$ and if I am not mistaken, the issue I raise is only a problem for $d > 2$.

Best Answer

It appears to be a mistake in the paper. They probably meant to set $$\text{tr}(P \log(Q)) = -\infty \tag{1}$$ when $\ker(Q) \not \subset \ker{P}$. While they talk about extending the function $(P,Q) \mapsto P \log(Q)$, it appears in reality they only care about the function $(P,Q) \mapsto \text{tr}(P\log(Q))$, and this is where the definition makes sense.

Explanation. Let's explain why (1) with a minus sign makes sense intuitively. As $P$ is positive definite, we can write $P$ and $Q$ in a basis where $Q$ is diagonal. To follow the notation of the paper, let $Q=U\widetilde{Q}U^T$, $P=U\widetilde{P} U^T$, where $U$ is an orthogonal matrix ($U^T = U^{-1}$) s.t. $\widetilde{Q}$ is diagonal, i.e. $\widetilde Q = \text{diag}(\sigma_1,\dots,\sigma_d)$ with $\sigma_1\geq \dots \geq \sigma_d$. Then $$\text{tr}(P\log(Q)) = \text{tr}(\widetilde P \log(\widetilde Q)) = \sum_{s=1}^{d} \widetilde{P}_{ss} \log(\sigma_s).\tag{2}$$ Note that both $\sigma_s$ and $\widetilde P_{ss}$ are non-negative. Hence, it makes sense to interpret $$\log(0) = \log(0+) = \lim_{\varepsilon\to 0+}(\log(\varepsilon)) = -\infty, \tag{3}$$ and set the rhs of (2) to $-\infty$ when $\sigma_s = 0$ and $\widetilde P_{ss} > 0$ for some $s$. That condition is equivalent to $\ker(Q) \not \subset \ker{P}$.

Upper-semicontinuity. Alternatively, one can define the same extension by the following formula $$\text{tr}(P\log(Q)) = \limsup_{\widetilde P \to P, \widetilde Q \to Q} \text{tr}(\widetilde{P} \log(\widetilde{Q})) = \limsup_{\widetilde Q \to Q} \text{tr}(P \log(\widetilde{Q})). \tag{4}$$ Before the extension is defined, the limit in (4) should be accross positive definite $\widetilde Q$. However, one can check that (4) continues to be valid (after the extension is defined) when $\widetilde{Q}$ is allowed to be positive semidefinite. In other words, the extension is the highest possible extension satisfying upper semicontinuity.

My question

Best Answer

Related Solutions

Probability Measure – Simple Explanation and Definition

[Math] Analysis of Kullback-Leibler divergence

Related Question