Machine Learning – Why Test Error Matters More Than Expected Test Error

conditional-expectationexpected valueloss-functionsmachine learning

In Section 7.2 of Hastie, Tibshirani, and Friedman (2013) The Elements of Statistic Learning, we have the target variable $Y$, and a prediction model $\hat{f}(X)$ that has been estimated from a training set $\mathcal{T} = \{Y_1, …, Y_N, X_1, …, X_N\}$. The loss is denoted $L(Y, \hat{f}(X))$, and then the authors define the test error:
\begin{equation}
\mathrm{Err}_{\mathcal{T}} = \mathbb{E} \left[ L(Y, \hat{f}(X)) | \mathcal{T} \right] ,
\end{equation}

and the expected test error:
\begin{equation}
\mathrm{Err} = \mathbb{E} (\mathrm{Err}_{\mathcal{T}}) .
\end{equation}

The authors then state:

Estimation of $\mathrm{Err}_{\mathcal{T}}$ will be our goal…

My question: Why do we care more about $\mathrm{Err}_{\mathcal{T}}$ than $\mathrm{Err}$?

I would have thought that the quantity that measures expected loss, regardless of the training sample used, would be more interesting than the expected loss that conditions on one specific training sample. What am I missing here?

Also, I've read this answer here which (based on my possibly incorrect reading) seems to agree with me that $\mathrm{Err}$ is the quantity of interest, but suggests that we often talk about $\mathrm{Err}_{\mathcal{T}}$ because it can be estimated by cross-validation. But this seems to contradict Section 7.12 of the textbook, which (again by my possibly incorrect reading) seems to suggest that cross-validation provides a better estimate of $\mathrm{Err}$ than $\mathrm{Err}_{\mathcal{T}}$.

I'm going around in circles on this one so thought I would ask here.

Best Answer

Why do we care more about $\operatorname{Err}_{\mathcal{T}}$ than Err?

I can only guess, but I think it is a reasonable guess.

The former concerns the error for the training set we have right now. It answers "If I were to use this dataset to train this model, what kind of error would I expect?". It is easy to think of the type of people who would want to know this quantity (e.g. data scientists, applied statisticians, basically anyone using a model as a means to an end). These people don't care about the properties of the model across new training sets per se, they only care about how the model they made will perform.

Contrast this to the latter error, which is the expectation of the former error across all training sets. It answers "Were I to collect an infinite sequence of new training examples, and were I to compute $\operatorname{Err}_{\mathcal{T}}$ for each of those training sets in an infinite sequence, what would be average value of that sequence of errors?". It is easy to think of the type of people who care about this quantity (e.g. researchers, theorists, etc). These people are not concerned with any one instance of a model (in contrast to the people in the previous paragraph), they are interested in the general behavior of a model.

So why the former and not the latter? The book is largely concerned with how to fit and validate models when readers have a single dataset in hand and want to know how that model may perform on new data.

Related Question