Probability Theory – Total Variation Distance and L1 Norm

measure-theoryprobabilityprobability distributionsprobability theory

Total variation distance is a measure for comparing two probability distributions (assuming that these are unit vectors in a finite space- where basis corresponds to the sample space ($\omega$)). I know a distance measure need to obey triangle inequality and it should satisfy that orthogonal vectors have maximum distance and the same distributions should have distance $0$. Others should like between these two. I completely don't understand why the $L^1$ norm is chosen for measuring the distance between these vectors (prob. distributions).
I also want to know why it is exactly defined the way it is. $TV(P_1,P_2) = \frac{1}{2}\sum_{x \in \omega} \mid {P_1(x)-P_2(x) \mid}$

Best Answer

The TV distance is measureing exactly what you want: The maximal difference between the assignments of probabilities of two probability distributions P and Q. And it is hence defined as $$TV(P,Q)=sup_{A\subset\Omega}|P(A)-P(Q)|$$

Now, as is shown in Proposition 4.2 here your last equation $$TV(P,Q)=\frac{1}{2}||P-Q||_1$$ is true (in countable prob. spaces). I will not redo the proof here, you can get a glimpse using the characterizations am_rf24 writes about in his answer. But I can give you some intuition: Although the definition of the TV distance seems closely resembling the definition of the infinity norm on vectors, it is actually subtly different. Note that the TV disctance is defined over events aka subsets of $\Omega$ while the infinity norm is over elements of $\Omega$. So to conclude: The one norm is rather coincidentially equivalent to the TV distance and is not forcedly choosen as the norm of choice. But because it is a norm we luckily do not need to show that the TV distance is a norm.

(Another remark: the orthogonality criterion you mention is not really a thing for a norm, as you can have norms without having a scalar product ;) )