Solved – High variance of leave-one-out cross-validation

biascross-validationvariance

I read over and over that the "Leave-one-out" cross-validation has high variance due to the large overlap of the training folds. However I do not understand why that is: Shouldn't the performance of the cross-validation be very stable (low variance) exactly because the training sets are almost identical?
Or am I having a wrong understanding of the concept of "variance" altogether?

I also do not fully understand how LOO can be unbiased, but have a high variance? If the LOO estimate is equal to the true estimator value in expectancy – how can it then have high variance?

Note: I know that there is a similar question here:
Why is leave-one-out cross-validation (LOOCV) variance about the mean estimate for error high? However the person who has answered says later in the comments that despite the upvotes he has realized that his answer is wrong.

Best Answer

This question is probably going to end up being closed as a duplicate of Variance and bias in cross-validation: why does leave-one-out CV have higher variance?, but before it happens I think I will turn my comments into an answer.

I also do not fully understand how LOO can be unbiased, but have a high variance?

Consider a simple example. Let the true value of a parameter be $0.5$. An estimator that yields $0.49,0.51,0.49,0.51...$ is unbiased and has relatively low variance, but an estimator that yields $0.1,0.9,0.1,0.9...$ is also unbiased but has much higher variance.

Shouldn't the performance of the cross-validation be very stable (low variance) exactly because the training sets are almost identical?

You need to think about the variance across different realizations of the whole dataset. For a given dataset, leave-one-out cross-validation will indeed produce very similar models for each split because training sets are intersecting so much (as you correctly noticed), but these models can all together be far away from the true model; across datasets, they will be far away in different directions, hence high variance.

At least that's how I understand it. Please see linked threads for more discussion, and the referenced papers for even more discussion.