Probability – Understanding Proof of Lemma Used in Hoeffding Inequality

probabilityprobability-inequalities

I am studying Larry Wasserman's lecture notes on Statistics which uses Casella and Berger as its primary text. I am working through his lecture notes set 2 and got stuck in the derivation of lemma used in Hoeffding's inequality (pp.2-3). I am reproducing the proof in the notes below and after the proof I will point out where I am stuck.

Lemma

Suppose that $\mathbb{E}(X) = 0$ and that $ a \le X \le b$. Then
$\mathbb{E}(e^{tX}) \le e^{t^2 (b-a)^2/8}$.

Proof

Since $a \le X \le b$, we can write $X$ as a convex combination of $a$ and $b$, namely
$X = \alpha b + (1 – \alpha) a$ where $\alpha = \frac{X-a}{b-a}$. By convexity of the function $y \to e^{ty}$ we have

$e^{tX} \le \alpha e^{tb} + (1 – \alpha) e^{ta} = \frac{X-a}{b-a} e^{tb} + \frac{b-X}{b-a} e^{ta}$

Take expectations of both sides and use the fact $\mathbb{E}(X) = 0$ to get

$\mathbb{E}(e^{tX}) \le \frac{-a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} = e^{g(u)}$

where $u = t(b-a)$, $g(u) = -\gamma u + \log(1-\gamma + \gamma e^{u})$ and $\gamma = -a /(b-a)$. Note that $g(0) = g^{'}(0) = 0$. Also $g^{''}(u) \le 1/4$ for all $u > 0 $.

By Taylor's theorem, there is a $\varepsilon \in (0, u)$ such that
$g(u) = g(0) + u g^{'}(0) + \frac{u^2}{2} g^{''}(\varepsilon) = \frac{u^2}{2} g^{''}(\varepsilon) \le \frac{u^2}{8} = \frac{t^2(b-a)^2}{8}$

Hence $\mathbb{E}(e^{tX}) \le e^{g(u)} \le e^{\frac{t^2(b-a)^2}{8}}$.

I could follow the proof till

$\mathbb{E}(e^{tX}) \le \frac{-a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} = e^{g(u)}$
but I am unable to figure out how to derive $u, g(u), \gamma$.

Best Answer

I’m not sure I understood your question correctly. I’ll try to answer: try to write $$-\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta}$$ as a function of $u = t(b-a)$ : this is natural as you want a bound in $e^{u^2 \over 8}$.

Helped by the experience, you will know that it is better to chose to write it in the form $e^{g(u)}$. Then $$e^{g(u)} = -\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta}$$ leads to $$\begin{align*} g(u) &= \log\left( -\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} \right)\\ &= \log\left( e^{ta} \left( -\frac{a}{b-a} e^{t(b-a)} + \frac{b}{b-a} \right)\right)\\ &= ta + \log\left( \gamma e^u + (1-\gamma) \right)\\ &= -\gamma u + \log\left( \gamma e^u + (1-\gamma) \right),\\ \end{align*}$$ with $\gamma = - {a \over b-a}$.

Is that the kind of thing you were asking for?

Edit: a few comments on the proof

The first trick deserves to be looked at carefully: if $\phi$ is a convex function, and $a\le X\le b$ is a centered random variable, then $$\mathbb{E}(\phi(X)) \le -{a\over b-a} \phi(b) + {b \over b-a} \phi(a) = \mathbb{E}(\phi(X_0)),$$ where $X_0$ is the discrete variable defined by $$\begin{align*} \mathbb P(X_0=a) &= {b \over b-a}\\ \mathbb P(X_0=b) &= -{a \over b-a}.\end{align*}$$ As a consequence, you get that $X_0$ is the centered variable with support in $[a,b]$ which has the highest variance: $$\mathsf{Var} (X) = \mathbb{E}(X^2) \le \mathbb{E}(X_0^2) = {ba^2 - ab^2 \over b-a } = - ab.$$ Note that if we fix a support width $(b-a)$, this is less than $(b-a)^2\over 4$ as Dilip says in the comments, this is because $(b-a)^2 + 4ab \ge 0$; the bound is attained for $a=-b$.
Now turn to our problem. Why is it possible to get a bound depending only on $u = t(b-a)$? Intuitively, it is just a matter of rescaling of $X$: if you have a bound $\mathbb{E}\left( e^{tX} \right) \le s(t)$ for the case $b-a=1$, then the general bound can be obtained by taking $s( t(b-a) )$. Now think to the set of centered variables with support of width 1: there isn’t so much freedom, so a bound like $s(t)$ should exist.

An other approach is to say simply that by the above lemma on $\mathbb{E}(\phi(X))$, then more generally $\mathbb{E}(\phi(tX)) \le \mathbb{E}(\phi(tX_0))$, which depends only on $u$ and $\gamma$: if you fix $u = u_0 = t_0 (b_0 - a_0)$ and $\gamma = \gamma_0 = - {a_0 \over b_0-a_0}$, and let $t, a, b$ vary, there is only one degree of freedom, and $t = {t_0 \over \alpha}$, $a = \alpha a_0$, $b = \alpha a_0$. We get $$-{a\over b-a} \phi(tb) + {b \over b-a} \phi(ta) = -{a_0\over b_0-a_0} \phi(tb_0) + {b_0 \over b_0-a_0} \phi(a_0).$$ You just have to find a bound involving only $u$.
Now we are convinced that it can be done, it must be much easier! You don’t necessarily think to $g$ to begin with. The point is that you must write everything as a function of $u$ and $\gamma$.

First note that $\gamma = -{a\over b-a}$, $1 -\gamma = {b\over b-a}$, $at = -\gamma u$ and $bt = (1-\gamma)u$. Then $$\begin{align*} \mathbb{E}(\phi(tX)) \le &-{a\over b-a} \phi(tb) + {b \over b-a} \phi(ta) \\ = & \gamma \phi((1-\gamma)u) + (1-\gamma) \phi(-\gamma u) \end{align*}$$

Now we are in the particular case $\phi=\exp$... I think you can finish.

I hope I clarified it a little bit.

Best Answer

Related Solutions

Solved – Understanding proof of McDiarmid’s inequality

Vapnik-Chervonenkis Inequality – Symmetrization by Ghost-Sample Probabilistic Argument

Related Question