I think it is clear that the second and third definitions are equivalent. In complete generality, the third definition should use a supremum instead of a maximum, but it's possible they were only considering finite probability spaces, in which case the maximum is sufficient.
The first definition only makes sense if both $P$ and $Q$ have densities with respect to a measure $\nu$. Since $\int_A (p-q) \, d\nu = P(A) - Q(A)$, it is clear that this definition is equivalent to the second and third (when the densities exist).
For the fourth definition note that the infimum is over all couplings (joint distribution) $\mathbb{P}$ for $(X, Y)$ such that the marginal distributions are $P$ and $Q$.
Fix a coupling $\mathbb{P}$.
\begin{align}
P(A) - Q(A)
&= \mathbb{P}(X \in A) - \mathbb{P}(Y \in A)
\\
&= \mathbb{P}(X \in A, X = Y) + \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X = Y) - P(Y \in A, X \ne Y)
\\
&= \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X \ne Y)
\\
&\le \mathbb{P}(X \ne Y) = \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}]
\end{align}
A similar argument shows $Q(A) - P(A) \le P(X \ne Y)$. Thus, for any coupling $\mathbb{P}$, we have
$$\sup_{A \in \mathcal{A}} |P(A) - Q(A)| \le \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}].$$
It remains to show that the infimum of the right-hand side is equal to the left-hand side. This can be done by constructing an "optimal" coupling. For finite probability spaces, see Lemma 4.1.13 here, Lemma 1(b) here, or Lemma 2.2 here. For more general spaces, see Theorem 2.12 here.
Response to comments:
Comment 1: I'm not an expert on the history of this, but yes it seems this definition is the most general and requires the least assumptions on the probability space. The others seem to correspond to special cases.
Comment 2: There is a minor issue when you tried to standardize the notation. I think $\mathcal{A}$ should be a sigma algebra on a probability space $\Omega$. Then the $A \in \mathcal{A}$ makes sense for definitions 1 and 2. But for definition 3, it should be either $A \subseteq \Omega$ or $A \in \mathcal{A}$. I am not sure what context Peres is using, but I think this definition only makes sense for finite spaces $\Omega$ (with the power set as the sigma algebra), since if the sigma algebra is infinite, there may not be a maximum. So in short, Definition 2 is the more general definition, and for finite spaces with the power set as the sigma algebra, the supremum over measurable sets can be written as a maximum over all subsets.
Comment 3: Yes, this is also mentioned on Wikipedia.
Comment 4:
- If $\Omega$ is finite, then any $\sigma$-algebra $\mathcal{A}$ is finite, since the power set is finite.
- If $\mathcal{A}$ is finite, then $\max_{A \in \mathcal{A}}$ exists and is equivalent to $\sup_{A \in \mathcal{A}}$.
- $\max_{A \subseteq \mathcal{A}}$ does not make sense. If $\mathcal{A}$ is the power set, then $\max_{A \in \mathcal{A}}$ is equivalent to $\max_{A \subseteq \Omega}$.
In stating his definitions, Villani always assumes that all measures are defined on the Borel $\sigma$-algebra of a Polish space. The topology and/or distance do not play any role in the definition of the total variation, since the latter only depends on the measurable structure.
Thus, Villani's definition is simply a particular case of other standard definitions in the literature. It is particular only in that he assumes the $\sigma$-algebra $\mathcal A$ to be the Borel $\sigma$-algebra of a given Polish topological space $X$, and the measurable space which Villani implicitly uses is the space $(X,\mathcal A)$.
This does not affect the definition of total variation in any other way.
Best Answer
I think there is some difference in definition. Look the lecture notes Probability in High Dimensions by Van-Handel. In example 4.14 the author writes:
$$ ||\mu - \nu||_{TV} = \inf_{M\in\mathcal C(\mu,\nu)}M(X\neq Y) $$
And he then goes on to prove this.
What might be happening is a different definition of the T.V metric.
Indeed, we can prove that using your definition of TV, the equality $$||\mu - \nu||_{TV} = \sup_A|\mu(A) - \nu(A)| = 2\inf P[X\neq Y]$$
Would be inconsistent. Note:
$$\mu(A) - \nu(A) = P[X \in A] - P[Y \in A] = $$ $$= P[X \in A, X=Y] - P[X \in A,X\neq Y]+ P[Y \in A,X=Y] - P[Y \in A,X\neq Y] = $$ $$ = P[X \in A, X\neq Y] - P[Y \in A, X \neq Y] \leq P[X\neq Y] $$ Therefore, $$\sup_A|\mu(A) - \nu(A)| \leq P[X\neq Y]$$ Hence, $\sup_A|\mu(A) - \nu(A)|>0 \implies 2P[X\neq Y]> \sup_A|\mu(A)-\nu(A)|$