Hoeffding’s inequality for conditional probability

conditional-expectationinequalityprobability theorystatistics

I am currently reading the paper Functional Classification in Hilbert Spaces by Biau, Bunea and Wegkamp, and there is one step in the proof of Theorem 1 that is not clear to me. I give below a simplified version of the steps of the proof that I do not understand, but the original calculations can be found in Section V-B, page 7 of the original paper.

To briefly introduce the notations, we are dealing with a binary classification problem. The dataset is made of $n$ i.i.d. pairs $(X_i,Y_i) \in \mathcal X \times \left\{0,1\right\}$ and we denote by $\hat\phi_n$ the classifier which belongs to a certain hypothesis class and is a function of the data. Lastly, $L$ and $\hat L$ are respectively the true and empirical losses of the classifier $\hat\phi_n$, i.e.
$$L:= \mathbb P\left(\hat\phi_n(X)\ne Y\ |\ (X_i,Y_i),1\le i\le n\right) \quad \text{ and }\quad \hat L :=\frac{1}{n}\sum_{i=1}^n\mathbf 1\left\{\hat\phi_n(X_i)\ne Y_i\right\} $$
The inequality I'm having trouble with is the following :
$$\begin{align}
\mathbb P\left(L-\hat L\ge \varepsilon\right) &= \mathbb E \mathbb P\left(L-\hat L\ge \varepsilon\ |\ (X_i,Y_i),1\le i\le n\right) \\
&\le \exp\left(-2n\varepsilon^2\right)
\end{align}$$

The first line is clearly true by the law of total expectation, and I understand that the second line is a direct application of Hoeffding's inequality since, conditional on the data, $n\hat L$ is a sum of i.i.d Bernoulli variables of parameter $L$. My issue is that I don't know how to formalize Hoeffding's inequality in this case where the probability is conditional on a sigma algebra, since it then becomes a random variable. Should I maybe show that Hoeffding's inequality holds for the conditional probability conditional on any event in $\sigma\left((X_i,Y_i),1\le i \le n\right) $ ?

In summary, my question is the following : How to formally derive Hoeffding's inequality for conditional probability with respect to a sigma algebra ?

Many thanks in advance for the help.


Update : First off, I noticed I got confused with the notations and $\hat\phi_n$ is, in fact, dependent on the data (in the paper at least), I fixed the mistake.

Regarding my question, I attempted to prove Hoeffding's inequality for conditional probability by following the proof of the original inequality and adapting it. So let $X_i$ be independent r.v.'s respectively bounded in $[a_i,b_i]$, $S_n = \sum_{i=1}^n X_i$ and $\mathcal F$ be a sigma-algebra :
By conditional Markov inequality, we have
$$\begin{align} \mathbb P\left(S_n-\mathbb E S_n \ge \varepsilon \ \vert\ \mathcal F\right)&\le e^{-s\varepsilon}\mathbb E \left[e^{s(S_n-\mathbb E S_n)} \ \vert\ \mathcal F\right]\\
&=e^{-s\varepsilon}\mathbb E \left[\exp\left(s\left(\sum_{i=1}^n X_i – \mathbb E X_i\right)\right) \ \vert\ \mathcal F\right] \\
&= e^{-s\varepsilon} \prod_{i=1}^n\mathbb E \left[e^{s\left( X_i – \mathbb E X_i\right)} \ \vert\ \mathcal F\right] \text{ (by independence)}\\
\end{align}$$

To conclude, I then have to prove/apply a conditional version of Hoeffding's lemma, but sadly that doesn't seem to work since $\mathbb E\left[X_i – \mathbb E[X_i]\ \vert \ \mathcal F\right]\ne 0$ in general. I however believe it's still possible to upper bound this conditional expectation and get a similar, but weaker version of the lemma for conditional probability, but that would yield a different result (I haven't done the calculation yet).

Am I on the right track ?

Best Answer

After looking at the problem again, I figured out what was wrong in my "conditional Hoeffding inequality" proof attempt : In the paper's setting, $L$ is not equal to $\mathbb E[\hat L]$ but rather $L = \mathbb E[\hat L\color{red}{\ \vert \mathcal F}] $ (by definition of conditional probability).
Therefore, the "true" conditional Hoeffding inequality I want to prove is in fact (with the same notations) : $$\mathbb P\left(S_n-\mathbb E [S_n\color{red}{\ \vert \mathcal F}] \ge \varepsilon \ \vert \mathcal F\right)\le \exp\left(\frac{-2\varepsilon}{\sum_{i=1}^n(b_i-a_i)^2}\right)$$ If I proceed similarly as in my initial attempt, I have $$\begin{align}\mathbb P\left(S_n-\mathbb E [S_n\color{red}{\ \vert \mathcal F}] \ge \varepsilon \ \vert \mathcal F\right)&\le e^{-s\varepsilon} \prod_{i=1}^n\mathbb E \left[e^{s\left( X_i - \mathbb E [X_i\color{red}{\ \vert \mathcal F}]\right)}\ \vert \mathcal F\right] \end{align} \tag1$$ And all I need to conclude is, once again, a conditional version of Hoeffding's lemma. But thankfully $\mathbb E\left[X_i - \mathbb E[X_i\ \vert \mathcal F]\ \vert \ \mathcal F\right]=0$ and by monotonicity of $\mathbb E[\cdot\ \vert \mathcal F]$ and convexity of $\exp$, one can follow the steps of the original proof to conclude that $$\mathbb E \left[e^{s\left( X_i - \mathbb E [X_i\color{red}{\ \vert \mathcal F}]\right)}\ \vert \mathcal F\right] \le \exp\left(\frac{s^2(b_i-a_i)^2}{8}\right)\ \ \forall\ 1\le i\le n $$

Finally, as in the original proof, plugging this inequality in $(1)$ and letting $s =\frac{4\varepsilon}{\sum_{i=1}^n(b_i-a_i)^2}$ gives the desired result.

Related Question