Solved – The meaning of conditional test error vs. expected test error in cross-validation

cross-validationdefinition

My textbook on cross-validation is The Elements of Statistical Learning by Hastie et al. (2nd ed.). In sections 7.10.1 and 7.12, they talk about the difference between conditional test error $$E_{(X^*,Y^*)}[L(Y, \hat{f}(X))|\tau]$$ and expected test error $$E_\tau [E_{(X^*,Y^*)}[L(Y, \hat{f}(X))|\tau]].$$
Here $\tau$ is the training data set, $L$ is the loss function, $\hat{f}$ is the model trained on $\tau$. $E$ is the expectation.

They explained that CV only estimates the expected test error well.

My question is, is there any reason why we would care about the conditional test error?

The only reason I could think of is that we want to answer the question 'If God puts $n$ data sets on the table, but only lets us take 1 home to fit our model, which one should we choose?'

Best Answer

I think you may be misunderstanding conditional test error. This may be because Hastie, Friedman, and Tibshirani (HFT) are not consistent in their terminology, sometimes calling this same notion "test error", "generalization error", "prediction error on an independent test set", "true conditional error", or "actual test error".

Regardless of name, it's the average error that the model you fitted on a particular training set $\tau$ would incur when applied to examples drawn from the distribution of (X,Y) pairs. If you lose money each time the fitted model makes an error (or proportional to the error if you're talking about regression), it's the average amount of money you lose each time you use the classifier. Arguably, it's the most natural thing to care about for a model you've fitted to a particular training set.

Once that sinks in, the real question is why one should care about expected test error! (HFT also call this "expected prediction error".) After all, it's an average over all sorts of training sets that you're typically never going to get to use. (It appears, by the way, that HFT intend an average over training sets of a particular size in defining expected test error, but they don't ever say this explicitly.)

The reason is that expected test error is a more fundamental characteristic of a learning algorithm, since it averages over the vagaries of whether you got lucky or not with your particular training set.

As you mention, HFT show the CV estimates expected test error better than it estimates conditional test error. This is fortunate if you're comparing machine learning algorithms, but unfortunate if you want to know how well the particular model you fit to a particular training set is going to work.