Solved – High Standard Deviation for Leave one out cross-validation

cross-validationmachine learningmeanstandard deviationstatistical significance

I am using the leave one out cross-validation technique to evaluate my model. If the prediction on the test sample is right the output is 1 otherwise 0. So I have a array of N samples with 0's and 1's at the end of it. I then average these values to get the average prediction accuracy and calculate the Standard deviation. I am getting the Average as 0.6 but the standard deviation is 0.5 (which is large). But the Mean + Standard deviation is more than the range , is that normal or is it wrong ?
I have read that leave one out tends to have high variance because of high correlation between models.
The second question is there a significance test I can do on the cross validation results to evaluate them?

Best Answer

The big standard deviation is completely. In fact, is is completely determined by the mean prediction accuracy: Your values are either 0 or 1 (with 60% being 1), so the standard deviation is $\sqrt{0.4\cdot(0-0.6)^2 + 0.6\cdot(1-0.6)^2} \approx 0.49$.

For why the variance of the prediction error rate is higher for leave-one-out CV compared to 10-fold CV, see this older answer.

Regarding your second question, I would use a permutation test: Completely shuffle the mapping between factors and labels of your training data, train a new model on it and calculate its mean prediction accuracy -- this estimates the accuracy that you get by chance[1]. Repeat this procedure several times to get a distribution of the chance prediction accuracy. Now compare your actual prediction accuracy (with unshuffled labels) to that distribution -- your $p$ value is the percentage of chance accuracies that are better than your actual prediction accuracy.

If you have few data points, you should do the permutation test with all possible permutations. Otherwise, you need enough repetitions to make sure that (the complete confidence interval of) the $p$ value is below your significance level. I don't have good rules of thumb for the general case; the relevant Wikipedia article links to this paper.


[1] For a balanced two-class problem, this chance level typically should be about 0.5; but it will decrease with the number of classes and increase if some classes are more frequent in your training data than others.