Show that the $\chi^2$-distance between probability measures $\mu,\nu$ is equal to $\chi^2(\nu,\mu)=\sup_f\left|\int f\:{\rm d}(\nu-\mu)\right|^2$

chi squaredmeasure-theoryprobability theorysigned-measures

Let $(E,\mathcal E)$ be a measurable space, $\mu$ and $\nu$ be probability measures on $(E,\mathcal E)$ and $$\chi^2(\nu,\mu):=\begin{cases}\displaystyle\mu\left|\frac{{\rm d}\nu}{{\rm d}\mu}-1\right|^2=\mu\left|\frac{{\rm d}\nu}{{\rm d}\mu}\right|^2-1&\text{, if }\nu\ll\mu\\\infty&\text{, otherwise}\end{cases}$$ denote the $\chi^2$-distance of $\mu$ and $\nu$.

I want to show that $$\chi^2(\nu,\mu)=\sup_f\left|\int f\:{\rm d}(\nu-\mu)\right|^2,\tag1$$ where the supremum is taken over all bounded $\mathcal E$-measurable $f:E\to\mathbb R$ with $\left\|f\right\|_{L^2(\mu)}\le1$.

The case $\nu\not\ll\mu$ is clear to me. So, assume $\nu\ll\mu$ and let $$\varrho:=\frac{{\rm d}\nu}{{\rm d}\mu}.$$ I think we need to distinguish the cases $\varrho\in L^2(\mu)$ and $\varrho\not\in L^2(\mu)$. If $\varrho\in L^2(\mu)$, then $${\chi^2(\nu,\mu)}^{\frac12}=\left\|\varrho-1\right\|_{L^2(\mu)}=\sup_{\substack{f\in L^2(\mu)\\\left\|f\right\|_{L^2(\mu)}\le1}}|\langle\varrho-1,f\rangle_{L^2(\mu)}|\tag2$$ as this is true for any Hilbert space.

How can we conclude from $(2)$? I guess we need to argue with the density of bounded $\mathcal E$-measurable $f:E\to\mathbb R$ in $L^2(\mu)$.

And how can we show the claim in the case $\varrho\not\in L^2(\mu)$, where we've clearly got $\chi^2(\nu,\mu)=\infty$?

Best Answer

First let's deal with the case $\rho \in L^2(\mu)$. Let $B$ be the unit ball in $L^2(\mu)$ and say $f \in B$ is in $B_b$ if and only if $f$ is additionally bounded.

What remains to show once $(2)$ is established is that $$\sup_{f \in B} | \langle \rho - 1, f \rangle | = \sup_{f \in B_b} | \langle \rho - 1, f \rangle |.$$ Clearly the left hand side is at least as big as the right hand side. Conversely, let $f \in B$. Define $$f_N = \begin{cases} f \qquad & |f| \leq N \\ N \qquad & \text{otherwise} \end{cases}$$ Then for each $N \in \mathbb{N}$, $f_N \in B_b$. Additionally, it's a simple application of the DCT to check that $f_N \to f$ in $L^2(\mu)$ as $N \to \infty$. This implies that $\langle \rho - 1, f_N \rangle \to \langle \rho - 1, f \rangle$ as $N \to \infty$. Hence $|\langle \rho - 1, f \rangle| \leq \sup_{g \in B_b} |\langle \rho - 1, g \rangle|$ which proves the desired equality.

Now we deal with the case $\rho \not \in L^2(\mu)$. Then $\chi^2(\mu, \nu) = \infty$ so we want to show that $$\sup_{f \in B_b} |\langle \rho, f \rangle | = \infty.$$ Instead I will show the contrapositive. So suppose that supremum above is finite.

Let $B_b^+ = \{f \in B_b: f \geq 0\}$ and define $B^+$ analogously. Each of the following equalities is not too difficult to check. $$ \sup_{f \in B_b} | \langle \rho, f \rangle | = \sup_{f \in B_b^+} |\langle \rho, f \rangle| = \sup_{f \in B^+} | \langle \rho, f \rangle | = \sup_{f \in B} |\langle \rho, f \rangle | $$ For the second equality, you should use an argument using cut-offs like I did above. The difference is that this time DCT won't work since we don't know a priori that $\rho f \in L^1(\mu)$ for arbitrary $f \in B$. However, we've restricted attention to positive functions so the MCT will do the job.

One slight subtlety is that you need to prove that the integrals appearing in the $4$th supremum are well-defined. To do this, note that the first two equalities, combined with the assumption, imply that for $f \in B$, $|\langle \rho, |f| \rangle | < \infty$ so that $\rho f \in L^1(\mu)$.

It is then a well known exercise in functional analysis to see that $\sup_{f \in B} |\langle \rho, f \rangle | < \infty$ implies that $\rho \in L^2(\mu)$- for example, see here.

Related Solutions

Probability Theory – Total Variation Distance of Probability Measures

First, I think that you are missing a factor of $2$ somewhere. The correct result is $$|\nu - \mu| = \frac12 \sup_f \bigg| \int f d(\nu - \mu) \bigg |.$$

The proof I know of this goes through another characterisation of the total variation distance given by Scheffe's lemma.

Scheffe's Lemma: Fix a reference measure $m$ such that there are measurable $g,h: \Omega \to [0,\infty)$ such that $d\mu = g dm$ and $d \nu = h dm$. Then $$|\nu - \mu| = \frac12 \int |g-h| dm.$$

Note that requiring the existence of a reference measure is no real imposition. Radon-Nikodym means you can always just take $m = \frac12 ( \mu + \nu)$.

Given this characterisation, the remaining inequality is straightforward. Indeed, \begin{align} \bigg |\int f d(\nu - \mu) \bigg | =& \bigg|\int f (g-h) dm \bigg | \\ \leq& \int |f| |g-h| dm \\ \leq& \int |g-h| dm = 2 |\nu - \mu| \end{align} for any $f$ with $\|f\|_\infty \leq 1$.

Here is a direct proof based on your idea to use the Hahn decomposition that doesn't use Scheffe's lemma. Let $A^+ = \{f \geq 0\}$ and $A^- = \{f < 0\}$. Further, set $\lambda = \nu - \mu$.

We can decompose $\bigg |\int f d\lambda\bigg|$ as \begin{align} \bigg | \int_{E^+ \cap A^+} f d\lambda + \int_{E^+ \cap A^-} f d\lambda + \int_{E^- \cap A^+} f d\lambda + \int_{E^- \cap A^-} f d\lambda \bigg | \end{align}

The advantage of this decomposition is that we know the signs of each of the terms and so, splitting terms into groups based on their sign, we get \begin{align} \bigg |\int f d\lambda\bigg| \leq& \bigg | \int_{E^+ \cap A^+} f d\lambda + \int_{E^- \cap A^-} f d\lambda \bigg | \\ +& \bigg |\int_{E^+ \cap A^-} f d\lambda + \int_{E^- \cap A^+} f d\lambda \bigg| \\ = &\int_{E^+ \cap A^+} f d\lambda + \int_{E^- \cap A^-} f d\lambda \\- & \bigg(\int_{E^+ \cap A^-} f d\lambda + \int_{E^- \cap A^+} f d\lambda \bigg) \end{align} Now, the worst case for bounding each of these terms occurs when $f$ is $1$ or $-1$ depending on the set we integrate over. For example, $$\int_{E^+ \cap A^+} f d\lambda \leq \int_{E^+ \cap A^+} 1 d\lambda = \lambda(E^+ \cap A^+)$$ Similarly, \begin{align} \int_{E^- \cap A^-} f d\lambda \leq &\int_{E^- \cap A^-} -1 d\lambda \leq -\lambda(E^- \cap A^-) \\ \int_{E^+ \cap A^-} f d\lambda \geq& \int_{E^+ \cap A^-} -1 d\lambda = -\lambda(E^+ \cap A^-) \\ \int_{E^- \cap A^+} f d\lambda \geq & \int_{E^- \cap A^+} 1 d\lambda = \lambda(E^- \cap A^+) \end{align} Plugging all of these bounds in and using additivity of the measures to combine terms and get rid of the $A$s, we get that $$\bigg|\int f d\lambda \bigg| \leq \lambda(E^+) -\lambda(E^-) \leq 2 |\nu - \mu|$$ as desired.

For completeness, what follows is a proof of the other inequality with the factor of $\frac{1}{2}$ present based on the idea of Hahn-decomposition. Note that for an arbitrary measurable set $A$, $|\lambda(A)| = |\lambda(A^c)|$ since $\lambda(\Omega) = 0$.

Hence $|\lambda(A)| = \frac12 (|\lambda(A)| + |\lambda(A^c)|)$. We can then write \begin{align} |\lambda(A)| \leq& \frac12 [ \lambda(A \cap E^+) - \lambda(A \cap E^-) + \lambda(A^c \cap E^+) - \lambda(A^c \cap E^-)] \\=& \frac12 (\lambda(E^+) - \lambda(E^-)) \\=& \frac12 \bigg|\int 1_{E^+} - 1_{E^-} d \lambda \bigg| \\ \leq& \frac12 \sup_f \bigg |\int f d\lambda \bigg | \end{align} which proves the other inequality.

Show that $\chi^2(\nu κ,μ)\le c\chi^2(\nu,μ)$ for all probability measures $\nu$ implies $\text{Var}_μ[κf]\le c\text{Var}_μ[f]$ for all $f\in L^2(μ)$

Instead of patching the holes you mentioned in your question, I found it simplest to use a fresh idea: applying the variational characterization of $\chi^2$ directly, to show that inequality $(1)$ yields a more general variant of $(2)$, from which the desired result will quickly follow.

Lemma. Let $\mu$ be a probability measure and let $\kappa$ be a kernel satisfying $\mu\kappa = \mu$. Let $c\geq 0$ be a constant such that $\chi^2(\nu\kappa,\mu)\leq c\chi^2(\nu,\mu)$ for all probability measures $\nu$. Then $$ \bigl[\mu(g\cdot\kappa f) - \mu g \cdot \mu f\bigr]^2 \leq c \textrm{Var}_{\mu}(g) \textrm{Var}_{\mu}(f)\text{ for all }f,g\in L^2(\mu). $$

Proof. First suppose that $f\geq 0$ and $\mu(f)=1$, so that $\nu=f\mu$ is a probability measure. Then $\chi^2(\nu,\mu)=\textrm{Var}_{\mu}(f)$. Rewriting $\chi^2(\nu\kappa,\mu)$ using the variational characterization of $\chi^2$ presented in Lemma 7.3 (ii) of the notes you linked to, we thus obtain that $$ \bigl[\mu(f\cdot \kappa g)-\mu(g)\bigr]^2\leq c\textrm{Var}_{\mu}(f)u(g^2), $$ for all bounded $g\in L^2(\mu)$. Replacing $g$ with $g-\mu(g)$ and simplifying leads to $$ \bigl[\mu(f\cdot \kappa g)-\mu(f)\cdot\mu(g)\bigr]^2\leq c\textrm{Var}_{\mu}(f)\textrm{Var}_{\mu}(g), $$ again for all bounded $g\in L^2(\mu)$. Recall our assumptions that $f\geq 0$ and $\mu(f)=1$ that were made in order to derive this inequality. Now observe that the inequality remains unchanged after scaling $f$ by a constant, so we may dispense with the assumption $\mu(f)=1$. Furthermore, the left side is unchanged after adding a constant function to $f$: indeed, this follows since $\mu(c\cdot \kappa g)=c\mu(g)$ by the hypothesis $\mu\kappa=\mu$. Thus, the condition $f\geq 0$ can be replacing with the weaker condition that $f$ is bounded from below. Finally, an approximation argument allows us to remove the boundedness assumptions from both $f$ and $g$, yielding the claim. $\square$

Taking $g=kf$ in the lemma yields $$ \bigl[\textrm{Var}_{\mu}(\kappa f) \bigr]^2 \leq c \textrm{Var}_{\mu}(\kappa f) \textrm{Var}_{\mu}(f). $$ Thus when $\textrm{Var}_{\mu}(kf)>0$ we can divide through to obtain $$ \textrm{Var}_{\mu}(\kappa f) \leq c \textrm{Var}_{\mu}(f) $$ as desired, and in the remaining case $\textrm{Var}_{\mu}(\kappa f)=0$ the inequality holds trivially.

Below this line are earlier attempts at answering the question, kept so that anyone interested can read through the evolution of this answer. The following was an attempt to complete the solution sketched in the question, but a hole was uncovered in the final step due to possibly non-negative cross terms appearing in the square. Further below this answer is an earlier solution, which misapplied the $\chi^2$ hypothesis using an incorrect identity $(f\mu)\kappa=(\kappa f)\mu$, effectively solving a different problem than the one which was asked.

You have shown above that the result follows from $$ \bigl|\langle \kappa f,g\rangle_{L^2(\mu)}\bigr|\leq \sqrt{c}\|f\|_{L^2(\mu)}\|g\|_{L^2(\mu)}\quad \text{for all }g\in L^2(\mu),\qquad (6) $$ which you proved for all $g\in L^2(\mu)$ such that $g\geq 0$ and $\mu(g)=1$ and for which $(g\mu)\kappa\ll \mu$. In fact, the final condition is unnecessary.

Claim. For all $g\in L^2(\mu)$ such that $g\geq 0$ and $\mu(g)=1$, we have that $(g\mu)\kappa\ll\mu$.

Proof. By $(1)$ we have that $\chi^2\bigl[(g\mu)\kappa,\mu\bigr]\leq c\chi^2(g\mu,\mu)<\infty$. Thus by the definition of $\chi^2$ you have given, it follows that $(g\mu)\kappa\ll \mu$. $\square$

Thus, it remains to show that if $(6)$ holds for all $g\in L^2(\mu)$ such that $g\geq 0$ and $\mu(g)=1$, then it holds for all $g\in L^2(\mu)$. You already pointed out that a simple scaling argument allows us to dispense with the condition $\mu(g)=1$.

To complete the final step, we take an arbitrary $g\in L^2(\mu)$ and decompose it into positive and negative parts as $g=g_+-g_-$ where $g_+=\max(g,0)$ and $g_-=\max(-g,0)$. Note that $g_{\pm}\geq 0$ and $g_+\cdot g_-=0$. Thus, $g^2=g_+^2+g_-^2$ and therefore $$ \mu(g^2)=\mu(g_+^2)+\mu(g_-^2). $$ Observe that $$ \bigl|\langle \kappa f,g\rangle_{L^2(\mu)}\bigr|^2= \bigl|\langle \kappa f,g_+\rangle_{L^2(\mu)}\bigr|^2+\bigl|\langle \kappa f,g_-\rangle_{L^2(\mu)}\bigr|^2-2\langle \kappa f,g_+\rangle_{L^2(\mu)}\langle \kappa f,g_-\rangle_{L^2(\mu)}, $$ so by the non-negative case of $(6)$ (which we have already established) $$ \bigl|\langle \kappa f,g\rangle_{L^2(\mu)}\bigr|^2\leq c\|f\|^2_{L^2(\mu)}\bigl(\|g_+\|^2_{L^2(\mu)}+\|g_-\|^2_{L^2(\mu)}\bigr)=c\|f\|^2_{L^2(\mu)}\|g\|^2_{L^2(\mu)}, $$ as desired.

Initially I posted the following long-winded answer to the first question asked above. However, after some reflection I realized that a simpler and clearer answer was to address the two sticking points in your approach, as I have done above. Keep reading to see my initial long-winded answer...

Before starting the argument, let me note a basic but important property of any Markov transition kernel $\kappa$. Since $\kappa(x,\cdot)$ is a probability measure for all $x$, we have that $\kappa\cdot 1 = 1$, or more generally for any constant function $c\in L^2(\mu)$ we have that $\kappa\cdot c=c$.

To show that $$ \textrm{Var}_{\mu}[\kappa f]\leq c \textrm{Var}_{\mu}[f]\quad \text{for all }f\in L^2(\mu),\qquad (\star) $$ we start by using the following identity to simplify the expression.

Claim. For all $\mu,\kappa,$ and $f$ as above, let $g=f-\mu(f)\in L^2(\mu)$. Then $$\textrm{Var}_{\mu}[\kappa f]=\mu\bigl[(\kappa g)^2\bigr].$$

Proof. By definition, $\textrm{Var}_{\mu}(f)=\mu[(f-\mu f)^2]$. Thus $$ \textrm{Var}_{\mu}[\kappa f]=\mu\bigl[(\kappa f-\mu\kappa f)^2\bigr]. $$ By invariance of $\mu$ under $\kappa,$ we have that $\mu(\kappa f)=\mu f$. Thus, $$ \textrm{Var}_{\mu}[\kappa f]=\mu\bigl[(\kappa f-\mu f)^2\bigr]=\mu\bigl[(\kappa g)^2\bigr],$$ where in the last equality we used that $\kappa [\mu(f)] = \mu (f)$ since $\kappa$ fixes constant functions. $\square $

Applying the claim to the left and right sides of $(\star)$ (using the Markov kernels $\kappa$ and $\textrm{id}$ respectively), we see that $(\star)$ is a consequence of the following inequality: $$ \mu\bigl[(\kappa g)^2\bigr]\leq c \mu\bigl[g^2\bigr]\quad \text{for all }g\in L^2(\mu)\text{ satisfying }\mu(g)=0.\qquad (\star\star) $$

Now let's work from the other direction, starting by expressing the $\chi^2$ condition in terms of the quantities we have been working with above.

Claim. Let $f\in L^2(\mu)$ be a function such that $f\geq 0$ and $\mu(f)=1$. Then $f\mu$ is a probability measure satisfying $$ \chi^2(f\mu,\mu)=\textrm{Var}_{\mu}(f). $$ Proof. Let $\nu=f\mu$. Since $f\geq 0$ we have that $\nu$ is an unsigned measure, and since $\nu(E)=\mu(f)=1$, we have that $\nu$ is in fact a probability measure. By definition of the Radon-Nikodym derivative, we further have that $$ \nu\ll \mu\quad\text{ and }\quad\frac{d\nu}{d\mu}=f. $$ Therefore $$ \chi^2(\nu,\mu):=\mu\bigl[(f-1)^2\bigr]=\mu\bigl[(f-\mu f)^2\bigr]=\textrm{Var}_{\mu}(f), $$ as desired. $\square $

Taking $f\in L^2(\mu)$ with $f\geq 0$ and $\mu(f)=1$, we observe that the function $\kappa f$ satisfies these same conditions as well. Indeed, $\kappa f\geq 0$ since $(\kappa f)(x)$ is the result of integrating $f$ against the measure $\kappa(x,\cdot)$ and is thus non-negative. Moreover, $\mu(\kappa f)=1$ since $\kappa$ preserves $\mu$.

Thus, the claim applies to both $f$ and $\kappa f$, yielding that $$ \chi^2(f\mu,\mu)=\textrm{Var}_{\mu}(f)\quad\text{ and }\quad \chi^2\bigl[(\kappa f)\mu,\mu\bigr]=\textrm{Var}_{\mu}(\kappa f). $$

The hypothesis states that $$ \chi^2(\nu\kappa,\mu)\leq c\chi^2(\nu,\mu), $$ which we will apply to $\nu=f\mu$. Since $\nu\kappa=(\kappa f)\mu$, when we substitute the identities in the previous display into the left and right sides of the given hypothesis we deduce that $$ \textrm{Var}_{\mu}(\kappa f)\leq c \textrm{Var}_{\mu}(f),\qquad \text{for all }f\in L^2(\mu),f\geq 0,\mu(f)=1. $$ Both sides of this inequality are homogeneous in $f$: replacing $f$ by $Cf$ for any constant $C$ scales both the left and right sides by $C^2$. Thus, we obtain the more general inequality $$ \textrm{Var}_{\mu}(\kappa f)\leq c \textrm{Var}_{\mu}(f),\qquad \text{for all }f\in L^2(\mu),f\geq 0.\qquad (\star\star\star) $$

To be clear: Equation $(\star\star\star)$ has been deduced from the hypothesis, whereas equation $(\star\star)$ is what needs to be established in order to obtain the desired result.

Thus the essence of your question boils down (after these preliminary rewritings) to the implication $(\star\star\star)\implies (\star\star)$. We now prove this using a simple truncation argument.

Truncation argument. Fix $g\in L^2(\mu)$ with $\mu(g)=0$. Let $g_n=\max(g,-n)$, which converges in a monotone fashion to the function $g$ as $n\to\infty$. Applying $(\star\star\star)$ to $f_n=g_n+n$, we obtain that $$ \textrm{Var}_{\mu}(\kappa (g_n+n))\leq c \textrm{Var}_{\mu}(g_n+n)=c \textrm{Var}_{\mu}(g_n). $$ Since $\kappa$ fixes constant functions, $\kappa (g_n+n)=\kappa g_n + n$, so the previous display yields $$ \textrm{Var}_{\mu}(\kappa g_n)\leq c\textrm{Var}_{\mu}(g_n). $$ Taking the limit as $n\to\infty$ on both sides and applying the monotone convergence theorem yields $$ \textrm{Var}_{\mu}(\kappa g)\leq c\textrm{Var}_{\mu}(g), $$ giving us the claim $(\star\star)$ since $\mu(g)=\mu(\kappa g)=0$.

Best Answer

Related Solutions

Probability Theory – Total Variation Distance of Probability Measures

Show that $\chi^2(\nu κ,μ)\le c\chi^2(\nu,μ)$ for all probability measures $\nu$ implies $\text{Var}_μ[κf]\le c\text{Var}_μ[f]$ for all $f\in L^2(μ)$

Related Question