Solved – the way to plot a learning curve for k fold cross-validation

classificationcross-validation

In the Coursera video lecture by Prof. Andrew Ng, he discusses about some basic good practices in Machine Learning. At the time stamp of around 11mins, in this video lecture, https://www.youtube.com/watch?v=ISBGFY-gBug the learning curve is shown which is a plot of cross-validation error and training error vs the size of the training set. I am doing the k fold cross-validation method for hyperparameter tuning and model selection.

In this scenario,

  • consider the variable Xdata to be the entire feature set which is split into training set, DataTrain that is used in the k fold setup and is further split into training subset and validation subset.
  • So, using DataTrain we have trainData and testData for the k fold setup.
  • Then there is an independent test set, denoted by the variable DataTest.

    When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

Best Answer

When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

No

  • The training error would be the average, over the K-folds, of the error on the trainData.

  • The test error would be the average, over the K-folds, of the error on testData

Remember that for each fold, the datasets trainData and testData are different.


Source:

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size