Solved – Information on how value of k in k-fold cross-validation affects resulting accuracies

cross-validationmachine learningsvm

I've been doing some Machine Learning, and have been using k-fold cross-validation to assess the generalisation performance of the algorithm. I've tried k-fold cross-validation with k = 5 and k = 200 and get very different results for Support Vector Machine classification.

k    SVM accuracy
-----------------
5    75%
200  94%

This seems like a huge difference in accuracy caused by changing the number of splits we're doing for the k-fold cross-validation. Is there any reason for this? I can't seem to find any references on studies that have been done investigating the effects of using different k values. Obviously, which k value I decide to use in my report gives completely different impressions of the quality of my classifier!

Best Answer

Not much of a "proof" but when k is small, you are removing a much larger chunk of your data, so you model has a much smaller amount of data to "learn from". For k=5 you are removing 20% of the data each time, whereas for k=200 you are only removing 0.5%. You model has a much better chance of picking up all the relevant "structure" in the training part when k is large. When k is small, the is a larger chance that the "left out" part will contain a structure which is absent from the "left in" bit - a bit like an "un-representative" sub-sample.