Solved – How exactly to partition training-set for k-fold cross validation on multi-class dataset

cross-validationimage processingmachine learningsvmvalidation

Cross validation is one of the most important tools because it gives us an honest assessment of the true accuracy of our system. In other words, the cross-validation process provides a much more accurate picture of our system's true accuracy.

If we have for example a dataset that contains one class (lets consider a face dataset):

In this case, we divide our dataset into k folds (or k portions). A common value of k is 10, so in that case, we would divide our dataset into 10 parts. We will run k rounds of cross validation. In each round, we use one the folds for validation, and the remaining folds for training. After training our classifier, we measure its accuracy on the validation data. Average the accuracy over k rounds to get a final cross-validation accuracy.

Prepare our dataset
Divide it into 10 folds.
for i=1:10 % ten times
fold(i) for testing
the remaining for training 
end
Final accuracy = Average(Round1, Round2, ...., Round10).

Else if we have a dataset that contains multiple class (lets consider 3 classes : faces, airplanes and strawberry categories):

I don't know if my opinion is correct or not : each of the three categories is split into 10 folds. Does we measure the final accuracy as above? for example:

For the first round: take the first folds from the first, second and third categories and use them as testing and all the remaining folds (from all the categories) as training.  

For the second round: take the second folds from the first, second and third categories and use them as testing and all the remaining folds (from all the categories) as training.

etc. until the round 10...

Does that is correct ? Does my opinion is correct? please I need your help and explanation.

Any help will be very appreciated.

Best Answer

First you need to decide whether you need model/parameter selection, or just model. Once your model is fixed, bootstrap seems make more sense to determine how your modeling procedure performs.

If you are implementing cross validation on multiple dataset, just randomly partition the data without considering their labels. It is possible sometimes that one label in test data set does not even gets trained, and it counts into the validation error. Usually a 10-fold cross-validation is highly recommended to repeat 50-100 times for stability.

You may try to avoid class imbalance issue (thus indirectly reduce the odds of the excluded label event mentioned above), but if your data really suffers from this problem, there are several re-sampling strategies in my previous answer in this post.

Related Solutions

Solved – repeat cross validation with a small dataset, and/or how can I improve the cross validation confidence

It seems as if you are using an improper scoring rule, proportion correctly classified. Optimizing this measure will choose a bogus model.

You will need to repeat 10-fold cross-validation 100 times to get sufficient precision for validation estimates, and be sure to use a proper scoring rule (e.g., Brier score (quadratic error score) or logarithmic scoring rule (log likelihood)).

K-Fold Cross Validation – Which K-Fold Cross Validation Strategy is Better?

I don't quite understand your methods, but here's what I know as cross validation sub-schemes, maybe that helps you clarifying the question:

assume you have 9 samples that are ordered 1 to 9, and you're doing 3-fold CV.

block wise: the data is divided into 3 consecutive blocks:
```
case    1    2    3    4    5    6    7    8    9
fold    1    1    1    2    2    2    3    3    3
```
~~I see hardly any application where this would be useful.~~ This can be useful to answer extract hints about extrapolation behaviour: the first and the last block then tell you how the model does at extrapolating just outside the domain covered by the training data (calibration range in chemometrics).
interleaved or stripes or ventian blinds: 1st case is assigned to fold 1, 2nd to fold 2, and so on:
```
case    1    2    3    4    5    6    7    8    9
fold    1    2    3    1    2    3    1    2    3
```
This is sometimes used for (chemical) calibration. Samples are sorted with e.g. increasing concentration of the analyte. This assignment scheme guarantees that both training and test cases for the surrogate models always span the concentration range as far (and evenly spaced) as possible.
random: you assign the cases to folds in a random fashion:
```
case    1    2    3    4    5    6    7    8    9
fold    3    3    1    1    2    1    2    3    2
```
You can do that by mixing your cases, and then using one of the above schemes.

IMHO the random scheme offers a crucial advantage: you can repeat the procedure. This is known as iterated or repeated $k$-fold cross validation. The iterations help you to reduce variance that is due to instability of the surrogate models (and to measure this instability), which is not possible with the upper 2 schemes. So iterated k-fold CV is the best and it implies random assignment, unless you have specific reasons for using one of the non-random schemes.

Note that if $k = n$, all 3 schemes are the same.

Cross validation always guarantees that each sample is tested exactly once during each iteration, and used exactly $k - 1$ times for training. If your splitting scheme doesn't have this property, it is not a cross validation. There are other splitting/resampling schemes for validation, such as hold-out/set validation (as opposed to 2-fold CV), out-of-bootstrap validation, etc.

Best Answer

Related Solutions

Solved – repeat cross validation with a small dataset, and/or how can I improve the cross validation confidence

K-Fold Cross Validation – Which K-Fold Cross Validation Strategy is Better?

Related Question