[Math] Exponential tail bounds without the moment generating function

pr.probability

As an exercise, I thought I would try to prove some classical Chernoff bounds without ever using the moment generating function, but then found myself getting stuck in certain places. Before I state my question, let me give some background.

$\mathbf{Background}$:

Consider for example the fact that there exist constants $c,c'$ such that for any $\lambda > 0$
\begin{equation}
(*)\ \mathop{\mathbb{P}}_\sigma\left(\sum_i \sigma_i x_i > \lambda\right) \le c \cdot e^{-c' \lambda^2/\|x\|_2^2}
\end{equation}
where $\sigma_1,\sigma_2,\ldots$ are independent Rademachers. One way to prove (*) for small $\lambda$ is to say
$$
\mathop{\mathbb{P}}_\sigma\left(\sum_i \sigma_i x_i > \lambda\right) = \mathop{\mathbb{P}}_\sigma\left(e^{t\sum_i \sigma_i x_i} > e^{t\lambda}\right) < e^{-t\lambda} \cdot \mathop{\mathbb{E}}_\sigma e^{t\sum_i \sigma_i x_i} = e^{-t\lambda}\cdot \prod_i \mathop{\mathbb{E}}_{\sigma_i} e^{t\sigma_i x_i} = e^{-t\lambda}\cdot \prod_i \frac 12(e^{-tx_i} + e^{tx_i}) \le e^{-t\lambda} \prod_i \frac 12(1 + e^{tx_i}t^2x_i^2/2) \le e^{-t\lambda} \prod_i e^{e^{tx_i}t^2x_i^2/2} \le e^{c\cdot t^2\|x\|_2^2/2 – t\lambda}
$$
as long as $t$ (which will depend on $\lambda$) is small enough so that the $e^{tx_i}$ term is at most $c$. Then one optimizes and sets something like $t \sim \lambda/\|x\|_2^2$. This proof uses the moment generating function (MGF), i.e. we talk about $\mathbb{E} e^{tX}$ in the proof for our random variable $X = \sum_i \sigma_i x_i$. At various points in the proof we also use analytic properties of the exponential function, e.g. based on Taylor's theorem.

A proof that avoids the MGF and Taylor's theorem is to work with moments. Define $\|X\|_p = (\mathbb{E}|X|^p)^{1/p}$ as usual and let $g_1,g_2,\ldots$ be i.i.d. gaussians of mean $0$ and variance $1$. Then
$$
\|\sum_i \sigma_i x_i\|_p = \sqrt{\frac{\pi}2} \|\mathop{\mathbb{E}}_g \sum_i \sigma_i |g|_i x_i\|_p \le \sqrt{\frac{\pi}2} \|\sum_i \sigma_i |g_i| x_i\|_p = \sqrt{\frac{\pi}2} \|\sum_i g_i x_i\|_p \lesssim \|x\|_2\cdot \sqrt{p}
$$
where the last inequality used $2$-stability of the gaussian (also the $\sqrt{\pi/2}$ can be avoided but never mind about that). Then we use that
$$
\mathop{\mathbb{P}}_\sigma\left(\sum_i \sigma_i x_i > \lambda\right) \le \mathop{\mathbb{P}}_\sigma\left(|\sum_i \sigma_i x_i|^p > \lambda^p\right) < \lambda^{-p} \cdot \|\sum_i \sigma_i x_i\|_p^p
$$
and set $p\sim \lambda^2 / \|x\|_2^2$. This proof is of the form I'm looking for: no MGF, no analytic tricks/Taylor's theorem. (It also works for all $\lambda$.)

$\mathbf{My\ question}$:
One formulation of the Chernoff bound is the following. There is some constant $c>0$ such that the following holds. Say $X_1,\ldots,X_n$ are independent and each bounded by $K$ in magnitude almost surely. Let $X = \sum_i X_i$ with $\mu = \mathbb{E} X$ and $\sigma^2 = \mathbb{E}(X – \mathbb{E}X)^2$. Then for all $\lambda > 0$,
$$
(**)\ \mathop{\mathbb{P}}\left(|X – \mu| > \lambda\right) \lesssim \max\left\{e^{-c\lambda^2/\sigma^2}, \left(\frac{\sigma^2}{\lambda K}\right)^{c\lambda/K}\right\} .
$$
One way to prove the above is to use the moment generating function and analytic tricks (see Exercise 3 at http://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/). My question is, similar to the first example, can we prove (**) simply using moment methods while avoiding the MGF and Taylor's theorem/analytic tricks? e.g. using combinations of Jensen's inequality, symmetrization, triangle inequality on $\|\cdot \|_p$, and other such things.

Note that (**) seems to be equivalent to the statement that for all $p \ge 1$,
$$
\|X – \mu\|_p \lesssim \sigma\sqrt{p} + K\frac{p}{\ln(epK^2/\sigma^2)} .
$$

Best Answer

Ok, I figured out the answer to my own question, though without the $\ln(ep K^2/\sigma^2)$ in the denominator. I'll see whether I can get that later (or maybe someone else sees it?). We can assume without loss of generality that $\mathbb{E} X_i = 0$ for each $i$ (otherwise apply what follows with the random variables replaced by $X_i - \mathbb{E} X_i$). The trick is to use that $\|\langle g,x \rangle\|_p$ is monotonically increasing as a function of $\|x\|_2$ (this follows by 2-stability of the gaussian). Let $\sigma_i$ be i.i.d. Rademacher and $g_i$ be i.i.d. normal. \begin{align} \|\sum_i X_i\|_p &= \|\sum_i X_i - \mathbb{E} X_i\|_p\\ &\le \|\sum_i (X_i - X_i')\|_p\\ &\le 2\|\sum_i \sigma_i X_i\|_p\\ &= 2\sqrt{\frac{\pi}{2}}\|\mathbb{E}_g\sum_i \sigma_i |g_i| X_i\|_p\\ &\lesssim \|\sum_i \sigma_i |g_i| X_i\|_p\\ &= \|\sum_i g_i X_i\|_p (1)\\ &\lesssim \sqrt{p} \|(\sum_i X_i^2)^{1/2}\|_p\\ &\le \sqrt{p} \|\sum_i X_i^2\|_p^{1/2}\\ &= \sqrt{p} \|\sum_i X_i^2 - \mathbb{E}\sum_i X_i^2 + \sum_i X_i^2\|_p^{1/2}\\ &\le \sigma\sqrt{p} + \sqrt{p}\|\sum_i X_i^2 - \mathbb{E}\sum_i X_i^2\|_p^{1/2}\\ &\lesssim \sigma\sqrt{p} + \sqrt{p}\|\sum_i g_i X_i^2\|_p^{1/2}\\ &= \sigma\sqrt{p} + \sqrt{p}\left(\mathbb{E}_X\|X\|_\infty^p\left(\mathbb{E}_g|\sum_i g_i X_i \frac{X_i}{\|X\|_\infty}|^p\right)\right)^{1/2p}\\ &\le \sigma\sqrt{p} + \sqrt{pK}\left(\mathbb{E}_X\mathbb{E}_g|\sum_i g_i X_i \frac{X_i}{\|X\|_\infty}|^p\right)^{1/2p}\\ &\lesssim \sigma\sqrt{p} + \sqrt{pK}\left(\mathbb{E}_X\mathbb{E}_g|\sum_i g_i X_i|^p\right)^{1/2p} (2)\\ &= \sigma\sqrt{p} + \sqrt{pK}\|\sum_i g_i X_i\|_p^{1/2} (3) \end{align}

Inequality (2) used that $\|\langle g,x\rangle\|_p$ is monotonically increasing as a function of $\|x\|_2$, as mentioned above. Now notice we have bounded (1) in terms of its square root in (3). Thus $\|\sum_i g_i X_i\|_p^{1/2}$ is at most the larger root of the associated quadratic equation, and this gives us the bound $$ \|\sum_i g_i X_i\|_p \lesssim \sigma\sqrt{p} + Kp $$ as desired.

I tried to be careful and write all the details to make it clear, but much of the above argument can be compressed once you're familiar with the right bag of tricks (symmetrization, gaussian comparison principle, and triangle inequality in the right places).

Related Question