Vapnik-Chervonenkis Inequality – Symmetrization by Ghost-Sample Probabilistic Argument

I am trying to resolve some compilation queries that arose in parsing the proof of the Vapnik-Chervonenkis inequality, and would appreciate some assistance on clarifying a particular step. The proof comes from a set of lecture notes by Robert Nowak (2009) closely following a similar strategy to the one outlined by Devroye, Gyorfi and Lugosi (1996) in their proof of the Glivenko-Cantelli theorem.

I am having trouble with the following line of the proof, extract provided further below:

$$\begin{multline}\mathbb{E}\left[\mathbb{1} \left\{\left \vert \hat{R}_n(\tilde{f}) – R(\tilde{f}) \right \vert > \epsilon \right\} \mathbb{1} \left\{\left \vert \hat{R}'_n(\tilde{f}) – R(\tilde{f}) \right \vert > \frac{\epsilon}{2} \right\} \right]
\\= \mathbb{E}\left[\mathbb{1} \left\{\left \vert \hat{R}_n(\tilde{f}) – R(\tilde{f}) \right \vert > \epsilon \right\} \mathbb{E}\left[\mathbb{1} \left\{\left \vert \hat{R}'_n(\tilde{f}) – R(\tilde{f}) \right \vert > \frac{\epsilon}{2} \right\} \, \middle | \, D_n \right] \right]
\end{multline}$$

1. Why is there a need to condition on the training sample $D_n = \{(X_i, Y_i) \}^n_{i=1}$ in the inner expectation, if it is the case that both the training sample $D_n$ and ghost sample $D_n' = \{(X_i', Y_i') \}^n_{i=1}$ are independently drawn from the same joint distribution, $D_n, D_n' \overset{i.i.d.}\sim P(X, Y)$?

2. If there is no redundancy in conditioning on the training sample $D_n$ due to independence, then what are the underlying distributions in all expectations?

Here is an extract of the proof:

My guess would be that the equality holds due to iterated expectations. But I'm not sure why this would be necessary due to independence of $D_n$ and $D_n'$. I will edit to include my scribblings in due course.

Best Answer

This equality can be made clearer by explicitly denoting the random variables with which each expectation is taken over. First, note that the outer expectation is with respect to both $D_n$ and $D_n'$. So, for the sake of clarity, I'll write $$\mathbb{E}_{D_n, D_n'}\left[I\left\{|\hat{R}_n(\tilde{f}) - R(\tilde{f}) > \epsilon| \right\}I\left\{|\hat{R}_n'(\tilde{f}) - R(\tilde{f}) < \epsilon/2| \right\}\right].$$

We can then leverage 1. the independence of $D_n$ and $D_n'$, and 2. The fact that only $\hat{R}_n'(\tilde{f})$ depends on $D_n'$, to get:

$$\begin{align*} \mathbb{E}_{D_n, D_n'}&\left[I\left\{|\hat{R}_n(\tilde{f}) - R(\tilde{f}) > \epsilon| \right\}I\left\{|\hat{R}_n'(\tilde{f}) - R(\tilde{f}) < \epsilon/2| \right\}\right] \\ &= \mathbb{E}_{D_n} \mathbb{E}_{D_n'}\left[I\left\{|\hat{R}_n(\tilde{f}) - R(\tilde{f}) > \epsilon| \right\}I\left\{|\hat{R}_n'(\tilde{f}) - R(\tilde{f}) < \epsilon/2| \right\}\right]\\ &=\mathbb{E}_{D_n}\left[I\left\{|\hat{R}_n(\tilde{f}) - R(\tilde{f}) > \epsilon| \right\}\mathbb{E}_{D_n'}\left[I\left\{|\hat{R}_n'(\tilde{f}) - R(\tilde{f}) < \epsilon/2| \right\}\right]\right]\\ &=\mathbb{E}_{D_n}\left[I\left\{|\hat{R}_n(\tilde{f}) - R(\tilde{f}) > \epsilon| \right\}\mathbb{E}_{D_n'}\left[I\left\{|\hat{R}_n'(\tilde{f}) - R(\tilde{f}) < \epsilon/2| \right\}\bigg\vert D_n \right]\right],\\ \end{align*} $$ where the last step was due to the fact that $p(D_n' | D_n) = p(D_n')$.

Best Answer

Related Solutions

Machine Learning – Why Hoeffding’s Inequality Can’t Be Used as the Final Bound if the Learnt Hypothesis is Part of $\mathcal{H}$