[Math] What does this expected value notation mean

definitionmachine learningnotationprobabilityrandom variables

From: Learning from Data, section 2.3.1 – Bias and Variance:

Let $f : X \rightarrow Y$, and let $D=\{(x,y=f(x)) : x \in A \subseteq X\}$ where each $x \in A$ is chosen independently with distribution $P(X)$. Assume we've chosen some function $g^D : X \rightarrow Y$ to approximate $f$ on $D$ with some error function.

Define $E_{\text{out}}(g^{D}) = \Bbb E_x[(g^{D}(x)-f(x))^2]$.

$(1.)$ What does $\Bbb E_x$ mean?

I understand the definition of an expected value of a random variable $$E[R] = \sum_{r \in R}P(R=r)$$ or vector (length $N$)$$\ E[\bar R] = [\sum_{r_j \in R_k} P(R_k = r_j)]_{k=0, 1, \dots, N-1}$$

But what does the notation $\Bbb E_x$ mean in this context?

$(2.)$ What does $\Bbb E_D[E_{out}(g^{D})]$

The book says it's the "expectation with respect to all data sets", but what does that mean? Expectations are operators on random variables. What is the random variable here? And how would I use proper notation to describe this?

As in $\Bbb E_D[E_{out}(g^{D})] = \sum_{?}?P(?)$

Best Answer

$(1.)$ What does $ E_x$ mean?

This means the expectation of the quantity in the brackets, with respect to $x \in X$ drawn from the probability distribution $P(X)$. I.e., as an integral:

$$ \int_X (g^D(x) - f(x))^2 p(x) dx $$

Where $p(x)$ is the density function of the distribution $P(X)$. This is the quantity often estimated from a sample with the in sample error of $g^D$:

$$ \frac{1}{n} \sum_i (g^D(x_i) - y_i)^2 $$

$(2.)$ What does $ E_D [ E_{out} [ g^D ]]$ mean?

The data set $D$ is random here. That is, we treat the data set we use to train our predictive model as random, and are averaging over all the possible training data sets according to their distribution.

$E_{out}$ stands for the average error across all random data sets, so there are really two random data sets being averaged over independently in this calculation:

  • $D$, the training data set, used to construct $g^D$.
  • An unnamed one averaged over in $E_{out}$, the testing data set.

That notation is indicating the average test error averaged across all possible training data sets. This is the quantity estimated with cross validation.

Related Question