Solved – How exactly to partition training-set for k-fold cross validation on multi-class dataset

cross-validationimage processingmachine learningsvmvalidation

Cross validation is one of the most important tools because it gives us an honest assessment of the true accuracy of our system. In other words, the cross-validation process provides a much more accurate picture of our system's true accuracy.

If we have for example a dataset that contains one class (lets consider a face dataset):

In this case, we divide our dataset into k folds (or k portions). A common value of k is 10, so in that case, we would divide our dataset into 10 parts. We will run k rounds of cross validation. In each round, we use one the folds for validation, and the remaining folds for training. After training our classifier, we measure its accuracy on the validation data. Average the accuracy over k rounds to get a final cross-validation accuracy.

Prepare our dataset
Divide it into 10 folds.
for i=1:10 % ten times
fold(i) for testing
the remaining for training 
end
Final accuracy = Average(Round1, Round2, ...., Round10). 

Else if we have a dataset that contains multiple class (lets consider 3 classes : faces, airplanes and strawberry categories):

I don't know if my opinion is correct or not : each of the three categories is split into 10 folds. Does we measure the final accuracy as above? for example:

For the first round: take the first folds from the first, second and third categories and use them as testing and all the remaining folds (from all the categories) as training.  

For the second round: take the second folds from the first, second and third categories and use them as testing and all the remaining folds (from all the categories) as training.

etc. until the round 10...

Does that is correct ? Does my opinion is correct? please I need your help and explanation.

Any help will be very appreciated.

Best Answer

First you need to decide whether you need model/parameter selection, or just model. Once your model is fixed, bootstrap seems make more sense to determine how your modeling procedure performs.

If you are implementing cross validation on multiple dataset, just randomly partition the data without considering their labels. It is possible sometimes that one label in test data set does not even gets trained, and it counts into the validation error. Usually a 10-fold cross-validation is highly recommended to repeat 50-100 times for stability.

You may try to avoid class imbalance issue (thus indirectly reduce the odds of the excluded label event mentioned above), but if your data really suffers from this problem, there are several re-sampling strategies in my previous answer in this post.