Solved – Why does k-fold cross validation generate an MSE estimator that has higher bias, but lower variance then leave-one-out cross-validation

bias-variance tradeoffcross-validation

Looks like the rationale behind the accepted answer of this post is incorrect.

Under leave one out cross validation(LOOCV), the variance of its MSE estimator is
$$var [\frac{\Sigma_i x_i}{n}] = \frac{var[\Sigma_i x_i]}{n^2}$$ where $x_i$ is an estimate of MSE from one particular iteration.

I agree that LOOCV has a higher enumerator (b/c of the covariance terms), but the denominator is larger as well because there are essentially n estimates (greater than k estimates as in the k-fold case).

Given this, why does LOOCV still have higher variance in estimating MSE and why does it have lower bias?

(This is against intuition b/c increasing sample should decrease variance and leaves bias unchanged for $\hat\theta$ and $\hat{y}$)

Best Answer

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the 2004 study by Bengio & Grandvalet which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$. Each data point $x_i$ contributes to an error term $\epsilon_i$ which are summed up into the MSE. Varying $K$ does not have a direct, algebraically straightforward impact on the variance of the CV estimator.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. These two effects are influenced by the value of $K$ which explains why different datasets and models will lead to different behaviours,

You will need to read through the extensive (and technical) literature to really grasp the subtlety and special cases.