Let $B = \{p \ge q\}$. Note that
\begin{align*}
\int_\Omega \def\abs#1{\left|#1\right|}\abs{p-q}\, d\nu
&= \int_B (p - q) \, d\nu + \int_{\Omega \setminus B} (q- p)\, d\nu\\
&\le 2 \sup_A \abs{\int_A (p-q) \, d\nu}
\end{align*}
On the other side, note first that
$$ \int_\Omega (p-q) \,d\nu = P(\Omega) - Q(\Omega) = 0 $$
and hence
$$ \int_B (p-q) \, d\nu = \int_{\Omega \setminus B} (q-p) \, d\nu $$
Now for any $A \in \mathscr F$, we have
\begin{align*}
\abs{\int_A (p-q)\, d\nu} &= \max\left\{\int_A (p-q)\, d\nu, \int_A (q-p)\, d\nu\right\}\\
&\le\max\left\{ \int_{A\cap B} (p-q)\, d\nu, \int_{A \cap (\Omega \setminus B)} (q-p)\, d\nu\right\}\\
&\le \max\left\{ \int_{B} (p-q)\, d\nu, \int_{\Omega \setminus B} (q-p)\, d\nu\right\}\\
&= \int_B (p-q)\, d\nu\\
&= \frac 12 \int_\Omega \abs{p-q}\,d\nu
\end{align*}
Taking the supremum over $A \in \mathscr F$, gives
$$ \sup_A \abs{\int_A (p-q)\, d\nu} \le \frac 12 \int_\Omega \abs{p-q}\, d\nu $$
which is the other needed inequality.
[Too long for a comment, but will restore your faith in KL divergence].
Here's why I like KL-divergence. Let's say you have two probability measures $\mu$ and $\nu$ on some finite set $X$. Someone secretly chooses either $\mu$ or $\nu$. You receive a certain number $T$ of elements of $X$ chosen randomly and independently according to the secret measure. You want to guess the secret measure correctly with high probability. What do you do?
The best "algorithm" to follow would be to observe the $T$ samples $x_1,\dots,x_T$ and choose $\mu$ or $\nu$ based on which one is more likely to generate these $T$ samples. The probability that $\mu$ generates these samples is $\prod_{j=1}^T \mu(x_j)$, and the probability $\nu$ generates these samples is $\prod_{j=1}^T \nu(x_j)$. So, we choose $\mu$ iff $\prod_{j=1}^T \frac{\mu(x_j)}{\nu(x_j)} > 1$, which is the same as $\sum_{j=1}^T \log \frac{\mu(x_j)}{\nu(x_j)} > 0$. If we let $Z : X \to [0,\infty]$ be the random variable defined by $Z(x) = \log \frac{\mu(x)}{\nu(x)}$, then the expected value of $Z$ under $\mu$ is exactly the KL-divergence between $\mu$ and $\nu$. And then of course the sum of $T$ independent copies of $Z$ has expectation $T\cdot KL(\mu,\nu)$.
As $T$ increases, by the weak law of large numbers, the average $\frac{1}{T}\sum_{t \le T} Z_t$ converges in probability to $KL(\mu,\nu)$. The fact that $KL(\mu,\nu) > 0$ (if $\mu \not = \nu$) corresponds to the fact that our algorithm will succeed with probability tending to $1$ as $T \to \infty$.
The point is that KL-divergence is an important quantity when trying to distinguish between different distributions, based on observed samples.
Best Answer
The TV distance is measureing exactly what you want: The maximal difference between the assignments of probabilities of two probability distributions P and Q. And it is hence defined as $$TV(P,Q)=sup_{A\subset\Omega}|P(A)-P(Q)|$$
Now, as is shown in Proposition 4.2 here your last equation $$TV(P,Q)=\frac{1}{2}||P-Q||_1$$ is true (in countable prob. spaces). I will not redo the proof here, you can get a glimpse using the characterizations am_rf24 writes about in his answer. But I can give you some intuition: Although the definition of the TV distance seems closely resembling the definition of the infinity norm on vectors, it is actually subtly different. Note that the TV disctance is defined over events aka subsets of $\Omega$ while the infinity norm is over elements of $\Omega$. So to conclude: The one norm is rather coincidentially equivalent to the TV distance and is not forcedly choosen as the norm of choice. But because it is a norm we luckily do not need to show that the TV distance is a norm.
(Another remark: the orthogonality criterion you mention is not really a thing for a norm, as you can have norms without having a scalar product ;) )