Reference Request – Error and Stability Estimates for Information Projection

entropyinformation-geometryit.information-theorypr.probabilityreference-request

$\newcommand\SS{P}\newcommand\TT{Q}$I will call a Gaussian probability measure $\SS$ on $\mathbb{R}^d$ isotropic if its covariance matrix is diagonal with non-vanishing determinant; i.e. $\Sigma_{i,i}>0$ for $i=1,\dots,d$ and $\Sigma_{i,j}=0$ whenever $i\neq j$ for each $i,j=1,\dots,d$.
Note: My definition of "isotropic" includes "the usual isotropic Gaussian measures," which, from my limited understanding, are assumed to have a covariance of the form $\sigma I_d$ for some $\sigma>0$.

Let $\mathcal{P}$ the set of isotropic Gaussian probability measures on $\mathbb{R}^d$ and let $\mathcal{Q}$ be the set of probability measures on $\mathbb{R}^d$ with Lebesgue density equipped with TV distance.

Consider the information projection (of I-projection) defined by
\begin{align}
\pi:\mathcal{Q} &\rightarrow \mathcal{P}
\\
\pi(\TT) &:= \operatorname*{argmin}_{\SS\in \mathcal{P}}\, D(\SS\parallel\TT)
\end{align}

I'm looking for references on the following "elementary properties" of the I-projection:

  • Is the I-projection $\pi$ Lipschitz, at-least locally?
  • Are there error bounds on $D(\pi(\TT),\TT)$ when $\TT$ is a Gaussian measure on $\mathbb{R}^d$ with non-singular covariance…

Best Answer

Assuming you want to minimize the Kullback–Leibler divergence $$D(P\parallel Q)=\int dP\,\ln\frac{dP}{dQ}$$ over all isotropic Gaussian $P$, "the" minimizer is in general not unique and, accordingly, not Lipschitz even on the set of measures $Q$ where it is unique.

The idea of a counterexample is quite simple: Suppose that $d=1$. Let $Q_h$ be the probability measure with pdf $q_h$ given by the formula $$q_h(x)=c_h\big((1+h)\,f_{1,a}(x)\,1(x>0) +(1-h)\,f_{-1,a}(x)\,1(x<0)\big)$$ for real $x$, where $f_{t,a}$ is the pdf of the normal distribution $N(t,a^2)$, $a>0$ is small enough (the condition $0<a<\sqrt{2/\pi}$ should do), $h$ is a real number very close to $0$, and $c_h(\approx1/2)$ is the normalizing factor.

Since $a$ is rather small, $Q_h$ is somewhat close to the mixture of the rather narrow normal distributions $N(1,a^2)$ and $N(-1,a^2)$ with slightly unequal weights, $c_h\,(1+h)$ and $c_h\,(1-h)$ respectively. So, a minimizer $P_h$ of the Kullback–Leibler divergence $D(P\parallel Q_h)$ in $P$ should be sufficiently close to $N(1,a^2)$ or $N(-1,a^2)$ depending on whether the small perturbation $h$ is $>0$ or $<0$, respectively. Thus, an infinitesimally small change from, say, $h>0$ to $-h<0$ will result in quite a nonnegligible change from $P_h\approx N(1,a^2)$ to $P_{-h}\approx N(-1,a^2)$. (If $h=0$, then there will be two minimizers.)

I can write down details later, if you want them.


Responding to the comment of the OP about a possible relation of your question to a result by Csiszár: Your question concerns the existence and uniqueness, for any given probability measure (PM) $Q$, of a PM $P_{\mathcal S,Q}\in\mathcal S$ such that $D(P_{\mathcal S,Q}\parallel Q)\le D(P\parallel Q)$ for all $P\in\mathcal S:=\mathcal P$, which latter is the set of all isotropic Gaussian PM's.

In contrast, Csiszár's result is that for any given PM $P$ (rather than $Q$) there is a unique PM $\tilde P^{\mathcal S,P}$ such that $D(\tilde P^{\mathcal S,P}\parallel Q)+D(P\parallel\mathcal S)\le D(P\parallel Q)$ for all $Q\in\mathcal S$ (rather than $P\in\mathcal S$), where $D(P\parallel\mathcal S):=\inf_{Q\in\mathcal S}D(P\parallel Q)$. (So, this looks like some kind of Pythagorean inequality.) A corollary to this result by Csiszár is that $D(\tilde P^{\mathcal S,P}\parallel Q)\le D(P\parallel Q)$ for all $Q\in\mathcal S$ (rather than $P\in\mathcal S$).

So, if $D$ were a metric, your question would be about the existence and uniqueness of a PM in $\mathcal S$ closest to $Q$. On the other hand, again if $D$ were a metric, the mentioned corollary from Csiszár's result would say that the length of the projection of the segment $\tilde P^{\mathcal S,P}Q$ of the segment $PQ$ onto $\mathcal S$ is no greater than the length of $PQ$, for any $Q$. The latter property would be equivalent to your "closest" property if $D$ were a Euclidean metric. But $D$ is not a metric at all. So, Csiszár's result says something different from what your question is about.

(The comparison of your question to Csiszár's result got more complicated than necessary because you interchanged the usual order of the arguments $P$ and $Q$ of $D(P\parallel Q)$, which was also used by Csiszár. So, I have edited your post, and mine, accordingly.)