Suppose that
- $h \in \mathcal{H}$ is a hypothesis in some class of binary classifiers $\mathcal{H}$,
- $\mathcal{D}_n$ is a training dataset of size $n$,
- $\mathcal{L}$ is the loss function for the binary classification problem defined as
$$
\mathcal{L}(x,h) =
\begin{cases}
1, & s(x) \not= h(x) \\
0, & \text{otherwise}
\end{cases}
$$
where $s(x)$ is the system we are trying to model, - $R_e(h)$ is the empirical risk of $h$ over a given dataset $\mathcal{D}_n$ defined as
$$
R_e(h) = \frac1n\sum_{i=1}^{n}\mathcal{L}(x_i, h(x_i))
$$ - and $R(h)$ is the true risk of the hypothesis $h$.
How can I show that
$$
\mathbb{E}_{\mathcal{D}_n}\left[R_e(h)\right] = R(h)
$$
where the expectation on the LHS is over all possible training datasets $\mathcal{D}_n$ of size $n$.
What I've tried so far
Since
$$
R_e(h) = \frac1n\sum_{i=1}^{n}\mathcal{L}(X_i, h(x_i))
$$
then
\begin{align}
\mathbb{E}_{\mathcal{D}_n}\left[R_e(h)\right] &= \int_{\mathcal{D}_n}{R_e(h)p(\mathcal{D}_n)} \\
&= \frac{1}{n}\int_{\mathcal{D}_n}{\sum_{x_i \in \mathcal{D}_n}\mathcal{L}(x_i, h)p(\mathcal{D}_n)}
\end{align}
I now want to manipulate this to convert it to
$$
R(h) = \int_{x}{\mathcal{L}(x,h)p(x)dx}
$$
I thought of grouping all the $x_i$'s out of the above equation, but I couldn't find a way to get the $p(x)$ term into the picture and this is where I am stuck. I am looking for progressive hints that will help me solve this myself.
Best Answer
Suppose the dataset is $\mathcal{D} = \{X_1, \dots, X_n\}$ where each data point $X_i$ is drawn i.i.d. from some distribution $f_X$. The true risk is:
$$R(h) = E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$
Show that $E_{\mathcal{D}_n}[R_e(h)] = R(h)$
$$E_{\mathcal{D}_n}[R_e(h)]$$
$$= E_{\mathcal{D}_n} \left [ \frac{1}{n} \sum_{i=1}^n \mathcal{L}(X_i, h(X_i)) \right ]$$
$$= \frac{1}{n} \sum_{i=1}^n E_{\mathcal{D}_n}[\mathcal{L}(X_i, h(X_i))]$$
$$= \frac{1}{n} \sum_{i=1}^n E_{X_i}[\mathcal{L}(X_i, h(X_i))]$$
$$= \frac{1}{n} \sum_{i=1}^n E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$
$$= E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$
This is equal to the true risk $R(h)$.
Alternative
Here's an equivalent way of proceeding, starting after step (3) above.
Explicitly write out the expected value over datasets. Because the data points are independent, the joint distribution of the dataset is equal to the product of the marginal distributions of the data points.
$$= \frac{1}{n} \sum_{i=1}^n \int \cdots \int \left ( \prod_{j=1}^n f_X(x_j) \right ) \mathcal{L}(x_i, h(x_i)) \ dx_1 \cdots dx_n$$
Reorder the integrals (see Fubini's theorem) and pull terms involving $x_i$ to the outside:
$$= \frac{1}{n} \sum_{i=1}^n \int f_X(x_i) \mathcal{L}(x_i, h(x_i)) \left [ \int \cdots \int \left ( \prod_{j \ne i} f_X(x_j) \right ) \ dx_1 \cdots dx_{i-1} \ dx_{i+1} \cdots dx_n \right ] dx_i$$
The expression inside the brackets is simply integrating a distribution, so it's equal to one:
$$= \frac{1}{n} \sum_{i=1}^n \int f_X(x_i) \mathcal{L}(x_i, h(x_i)) dx_i$$
The integral is the expected value of $\mathcal{L}(\cdots)$ with respect to $f_X$:
$$= \frac{1}{n} \sum_{i=1}^n E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$
This is the same as the result of step (5) above, so proceed to (6).