Solved – Bias-variance tradeoff associated with cross validation methods

bias-variance tradeoffcross-validation

I was reading about the bias-variance tradeoff associated with cross validation methods on James et al, Introduction to Statistical Learning (Page 183-184).

When we perform LOOCV, we are in effect averaging the outputs of n fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other.

What exactly is meant by "outputs of n fitted models"?
For eg., if we are using linear regression, do these outputs refer to the model parameters?

Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.

What exactly are the quantities being referred to in the above excerpt?
For eg., if I have have 5 features in a data set of 100 observations and I want to use linear regression and LOOCV, I will get 100 different models where each model has 6 parameter estimates. Since the 100 models only differ by one observation, they are almost the same. When we talk about correlated quantities, are we talking about the parameters in these 100 models? And if yes, how does that lead to the test error estimate of LOOCV having a higher variance?

Best Answer

In addition to @qwr's answer:

I think it is important to realize that lots of arguments about LOO (or other types of CV) are ambiguous because of unstated assumptions and approximations. At a first glance, that may be the solution here as well:

  • LOO can be seen as an approximation to the one model obtained on all n cases (which is typcially a good approximation).
    I'd call this the "applied use of CV"
  • But LOO is also treated as an approximation to a model obtained on n other cases from the same population - and that is a much worse approximation. And this is a situation where the high correlation of the LOO models (which is higher than the correlation between n models on n new data sets of size n-1 would be) hurts, i.e. causes one to underestimate variance. I'd call this "the algorithm developer's use of CV".
  • As k-fold CV leaves out more cases than LOO at a time, approximation 1 becomes worse, while approximation 2 becomes less bad (I'm not saying better, because for typical $k$ it won't be that much better).

Keep in mind that there are further contributions to the total variance of the CV estimate:

  • the sample-to-sample variance, which is not affected by LOO vs. k-fold
  • model instability, which may be slightly worse for CV than for LOO. So that would counteract the effect found above.

The IMHO much worse correlation in LOO is the 1:1 correspondence of surrogate model and tested case: this precludes any attempt to separate the model stability-type variance from sample-to-sample variance - and as "applied" person, that distinction is very important for me.

Related Question