Solved – How to show that the average empirical risk is equal to the true risk for a binary classifier

Suppose that

$h \in \mathcal{H}$ is a hypothesis in some class of binary classifiers $\mathcal{H}$,
$\mathcal{D}_n$ is a training dataset of size $n$,
$\mathcal{L}$ is the loss function for the binary classification problem defined as
$$
\mathcal{L}(x,h) =
\begin{cases}
1, & s(x) \not= h(x) \\
0, & \text{otherwise}
\end{cases}
$$
where $s(x)$ is the system we are trying to model,
$R_e(h)$ is the empirical risk of $h$ over a given dataset $\mathcal{D}_n$ defined as
$$
R_e(h) = \frac1n\sum_{i=1}^{n}\mathcal{L}(x_i, h(x_i))
$$
and $R(h)$ is the true risk of the hypothesis $h$.

How can I show that
$$
\mathbb{E}_{\mathcal{D}_n}\left[R_e(h)\right] = R(h)
$$
where the expectation on the LHS is over all possible training datasets $\mathcal{D}_n$ of size $n$.

What I've tried so far

Since
$$
R_e(h) = \frac1n\sum_{i=1}^{n}\mathcal{L}(X_i, h(x_i))
$$
then
\begin{align}
\mathbb{E}_{\mathcal{D}_n}\left[R_e(h)\right] &= \int_{\mathcal{D}_n}{R_e(h)p(\mathcal{D}_n)} \\
&= \frac{1}{n}\int_{\mathcal{D}_n}{\sum_{x_i \in \mathcal{D}_n}\mathcal{L}(x_i, h)p(\mathcal{D}_n)}
\end{align}
I now want to manipulate this to convert it to
$$
R(h) = \int_{x}{\mathcal{L}(x,h)p(x)dx}
$$
I thought of grouping all the $x_i$'s out of the above equation, but I couldn't find a way to get the $p(x)$ term into the picture and this is where I am stuck. I am looking for progressive hints that will help me solve this myself.

Best Answer

Suppose the dataset is $\mathcal{D} = \{X_1, \dots, X_n\}$ where each data point $X_i$ is drawn i.i.d. from some distribution $f_X$. The true risk is:

$$R(h) = E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$

Show that $E_{\mathcal{D}_n}[R_e(h)] = R(h)$

Start with the LHS:

$$E_{\mathcal{D}_n}[R_e(h)]$$

Plug in the expression for the empirical risk $R_e(h)$:

$$= E_{\mathcal{D}_n} \left [ \frac{1}{n} \sum_{i=1}^n \mathcal{L}(X_i, h(X_i)) \right ]$$

By linearity of expectation:

$$= \frac{1}{n} \sum_{i=1}^n E_{\mathcal{D}_n}[\mathcal{L}(X_i, h(X_i))]$$

Because $\mathcal{L}(X_i, h(X_i))$ only depends on $X_i$, the joint expectation (over datasets) is equal to the marginal expectation (over data point $X_i$):

$$= \frac{1}{n} \sum_{i=1}^n E_{X_i}[\mathcal{L}(X_i, h(X_i))]$$

The expected value is the same for all $X_i$ because they're identically distributed. So, we can replace $X_i$ with a generic variable $X$ drawn from the same distribution $f_X$:

$$= \frac{1}{n} \sum_{i=1}^n E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$

Simplify:

$$= E_{X \sim f_X}[\mathcal{L}(X, h(X))]$$

This is equal to the true risk $R(h)$.

Alternative

Here's an equivalent way of proceeding, starting after step (3) above.

Explicitly write out the expected value over datasets. Because the data points are independent, the joint distribution of the dataset is equal to the product of the marginal distributions of the data points.

$$= \frac{1}{n} \sum_{i=1}^n \int \cdots \int \left ( \prod_{j=1}^n f_X(x_j) \right ) \mathcal{L}(x_i, h(x_i)) \ dx_1 \cdots dx_n$$

Reorder the integrals (see Fubini's theorem) and pull terms involving $x_i$ to the outside:

$$= \frac{1}{n} \sum_{i=1}^n \int f_X(x_i) \mathcal{L}(x_i, h(x_i)) \left [ \int \cdots \int \left ( \prod_{j \ne i} f_X(x_j) \right ) \ dx_1 \cdots dx_{i-1} \ dx_{i+1} \cdots dx_n \right ] dx_i$$

The expression inside the brackets is simply integrating a distribution, so it's equal to one:

$$= \frac{1}{n} \sum_{i=1}^n \int f_X(x_i) \mathcal{L}(x_i, h(x_i)) dx_i$$

The integral is the expected value of $\mathcal{L}(\cdots)$ with respect to $f_X$:

This is the same as the result of step (5) above, so proceed to (6).

What I've tried so far

Best Answer

Show that $E_{\mathcal{D}_n}[R_e(h)] = R(h)$

Alternative

Related Solutions

Bayesian Inference – Difference Between Risk Function in Bayesian Inference and Supervised Learning

Related Question