Solved – plot learning curves with k fold cross validation

biascross-validationvariance

I want to plot a learning curve to see how the error rate of my model varies as the number of training data increases.

To get the training error, it's simple, I just train and evaluate my model on an increasing portion of the dataset.

However, to get the cross validation error, I don't know the correct way of combining the k-fold cross validation technique while gradually incrementing the size of the training dataset.

What is the correct approach to use for plotting cross validation learning curve and using k-fold cross validation?

I know it would be easier if my test set was fixed like when using the holdout method instead of k-fold. In that case I would have 30% of my dataset assigned for testing. I would gradually increment the training set from the remaining 70% and test my model on the 30% holdout set.

But this method is criticised in many textbooks I read. So, I'd rather use the k-fold cross validation method instead.

Best Answer

First of all, while I'd usually agree that hold-out is not making efficient use of the available samples and the typical set-up is prone to the same mistakes as cross validation, repeated set validation / repeated hold-out is a resampling technique that I think is well suitable for your learning curve calculation. This way, you can reflect what is going on inside the data set you have covering the variation due to different splits (but not fully the variation you'd have to expect with new data set of size $n$). This way you get the fine-grained control over training set size of hold out together with resampling properties close to k-fold.


However, here's a caveat for the informed decision: in case you are talking about small sample size classification, the usual figures of merit (sensitivity, specificity, overall accuracy etc.) are subject to very high testing variance. This testing variance is limited by the number of actual independent cases you have in the denominator of the calculation and can easily be so large that you cannot sensibly use such measured learning curves (keep in mind, "use" typically means extrapolation).

See our paper for details: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33. DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323