Definitions of the total variation distance

measure-theoryprobabilityprobability distributionsprobability theorytotal-variation

Introduction. I found the following definitions of the total variation distance $d_{TV}$ between two probability distributions (also called probability measures) $P$ and $Q$ on $\mathcal{A}$ (please note that I tried to use a consistent notation in the following definitions!):

\begin{align}
&{\color{blue}{\textbf{"Definition 2.4" on page 84, in Tsybakov (2009)}}} \\&\hspace{10ex} d_{TV}(P,Q)
= \sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right| = \sup_{A \in \mathcal{A}} \left| \int_{A} (p-q)d\nu \,\right| \\ \\
&{\color{blue}{\textbf{"2.1 Definition" on page 5, in Strasser (1985)}}} \\&\hspace{10ex}d_{TV}(P,Q)
= \left\Vert P-Q\right\Vert = \sup \{\left| P(A)-Q(A) \right| : {A \in \mathcal{A}} \} \\ \\
&{\color{blue}{\textbf{"4.1. Total Variation Distance" on page 47, in Levin&Peres (2017)}}} \\&\hspace{10ex}d_{TV}(P,Q)
= \left\Vert P-Q\right\Vert = \max_{A \subseteq \mathcal{A}} \left| P(A)-Q(A) \right| \\ \\
&{\color{blue}{\textbf{On page 22, in Villani (2008)}}}\\&\hspace{10ex}d_{TV}(P,Q)
= \left\Vert P-Q\right\Vert = 2 \inf \left\{ \mathbb{E} [\mathcal{1}_{X \neq Y}]; \,\text{law}(X)=P, \text{law}(Y)=Q \right\}
\\
\end{align}

Question. Since I got confusions on the variety of definitions of the total variation distance, could you please show/prove/derive one, or more (or all!), of the following equalities? Or suggest some references proving those equalities?

\begin{align}
&{\color{red}{\textbf{First Equality:}}} \qquad
&&\sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right|
\stackrel{\bf{{\color{red}?}}}{=}
\sup_{A \in \mathcal{A}} \left| \int_{A} (p-q)d\nu \,\right|
\\ \\
&{\color{red}{\textbf{Second Equality:}}} \qquad
&&\left\Vert P-Q\right\Vert
\stackrel{\bf{{\color{red}?}}}{=}
\sup \{\left| P(A)-Q(A) \right| : {A \in \mathcal{A}} \}
\\ \\
&{\color{red}{\textbf{Third Equality:}}} \qquad
&&\left\Vert P-Q\right\Vert
\stackrel{\bf{{\color{red}?}}}{=}
\max_{A \subseteq \mathcal{A}} \left| P(A)-Q(A) \right|
\\ \\
&{\color{red}{\textbf{Fourth Equality:}}} \qquad
&&\left\Vert P-Q\right\Vert
\stackrel{\bf{{\color{red}?}}}{=}
2 \inf \left\{ \mathbb{E} [\mathcal{1}_{X \neq Y}]; \,\text{law}(X)=P, \text{law}(Y)=Q \right\}
\\
\end{align}

References.

  1. Tsybakov (2009)
  2. Strasser (1985)
  3. Levin&Peres (2017)
  4. Villani (2008)

Best Answer

I think it is clear that the second and third definitions are equivalent. In complete generality, the third definition should use a supremum instead of a maximum, but it's possible they were only considering finite probability spaces, in which case the maximum is sufficient.

The first definition only makes sense if both $P$ and $Q$ have densities with respect to a measure $\nu$. Since $\int_A (p-q) \, d\nu = P(A) - Q(A)$, it is clear that this definition is equivalent to the second and third (when the densities exist).


For the fourth definition note that the infimum is over all couplings (joint distribution) $\mathbb{P}$ for $(X, Y)$ such that the marginal distributions are $P$ and $Q$.

Fix a coupling $\mathbb{P}$.

\begin{align} P(A) - Q(A) &= \mathbb{P}(X \in A) - \mathbb{P}(Y \in A) \\ &= \mathbb{P}(X \in A, X = Y) + \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X = Y) - P(Y \in A, X \ne Y) \\ &= \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X \ne Y) \\ &\le \mathbb{P}(X \ne Y) = \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}] \end{align} A similar argument shows $Q(A) - P(A) \le P(X \ne Y)$. Thus, for any coupling $\mathbb{P}$, we have $$\sup_{A \in \mathcal{A}} |P(A) - Q(A)| \le \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}].$$

It remains to show that the infimum of the right-hand side is equal to the left-hand side. This can be done by constructing an "optimal" coupling. For finite probability spaces, see Lemma 4.1.13 here, Lemma 1(b) here, or Lemma 2.2 here. For more general spaces, see Theorem 2.12 here.


Response to comments:

Comment 1: I'm not an expert on the history of this, but yes it seems this definition is the most general and requires the least assumptions on the probability space. The others seem to correspond to special cases.

Comment 2: There is a minor issue when you tried to standardize the notation. I think $\mathcal{A}$ should be a sigma algebra on a probability space $\Omega$. Then the $A \in \mathcal{A}$ makes sense for definitions 1 and 2. But for definition 3, it should be either $A \subseteq \Omega$ or $A \in \mathcal{A}$. I am not sure what context Peres is using, but I think this definition only makes sense for finite spaces $\Omega$ (with the power set as the sigma algebra), since if the sigma algebra is infinite, there may not be a maximum. So in short, Definition 2 is the more general definition, and for finite spaces with the power set as the sigma algebra, the supremum over measurable sets can be written as a maximum over all subsets.

Comment 3: Yes, this is also mentioned on Wikipedia.

Comment 4:

  • If $\Omega$ is finite, then any $\sigma$-algebra $\mathcal{A}$ is finite, since the power set is finite.
  • If $\mathcal{A}$ is finite, then $\max_{A \in \mathcal{A}}$ exists and is equivalent to $\sup_{A \in \mathcal{A}}$.
  • $\max_{A \subseteq \mathcal{A}}$ does not make sense. If $\mathcal{A}$ is the power set, then $\max_{A \in \mathcal{A}}$ is equivalent to $\max_{A \subseteq \Omega}$.