Solved – the difference between different kind of cross validation methods

cross-validationMATLAB

Now I am using MATLAB, and found one of its function named crossvalind, which provides lots of methods for cross validation, including some that I wasn't able to find on this site.

[TRAIN,TEST] = crossvalind('LeaveMOut',N,M), where M is an integer,
returns logical index vectors for cross-validation of N observations by
randomly selecting M of the observations to hold out for the evaluation
set. M defaults to 1 when omitted. Using LeaveMOut cross-validation
within a loop does not guarantee disjointed evaluation sets. Use K-fold
instead.

It seems that this kind of cross-validation can be used in a loop, but it cannot guarantee disjointed evaluation. I think that this kind of cross-validation has the advantage that its partition could be updated in every loop, rather than being fixed before the loop as in k-fold cross-validation. This means that more 'dynamics' could be tested by this kind of cross validation. But I am not sure about this leave-m-out cross-validation: Could anyone states the difference between leave-m-out cross validation and k-fold validation, what are their advantages or disadvantages, and when should we choose one rather than the other?

To @a.desantos

In k-fold cross-validation, from the set of loops one sample can be in the test set for only one time, but in leave-M-out cross-validation one sample could appear in more than one test sets.

Edit 1

Some new issues inspired by @a.desantos. What m and n should I consider? Considering (by convention) that 10-fold cross-validation is nice, I should choose m as 4 when n would be 40; the combination of C(40,4) is 91,390. Should I generate all 91,390 different combinations, then run 91,390 tests, and average the performance of the specific learning algorithm?

To @Frank Harrell

Edit 2

you mean the train-test routine will be runned for 10 times since each time the data is split into ten disjoint parts in a 10 fold cross validation, this will generate one average test error E1, and this error is not reliable, then i need to run this 10-loop-test-routine for about 5-10 times and average all E1s to get another test error named E2, this E2 stands for the ultimate average error, and is more reliable than E1, right?

Best Answer

The X-validation method LeaveMOut it is a common strategy. In fact, when modelling a specific classifier, LeaveMOut allows you to create training and testing data easily. As this procedure is repeated several times randomly, you average the performance.

However, LeaveMOut is a kind of k-fold cross validation where (k-1) folds are used for training and the remaining fold is used for testing.

There is no such a big difference. Maybe the only difference is that LeaveMOut does not allow validation data, as it only leaves M samples out of the training data. When it said that using LMO on a loop does not guarantee disjointed evaluation sets, it means that samples may be in both places every loop and this maybe conflictive for some applications (not in general, in my opinion).

Related Solutions

Cross Validation – Is Cross Validation a Proper Substitute for Validation Set?

You have indeed correctly described the way to work with crossvalidation. In fact, you are 'lucky' to have a reasonable validation set at the end, because often, crossvalidation is used to optimize a model, but no "real" validation is done.

As @Simon Stelling said in his comment, crossvalidation will lead to lower estimated errors (which makes sense because you are constantly reusing the data), but fortunately this is the case for all models, so, barring catastrophy (i.e.: errors are only reduced slightly for a "bad" model, and more for "the good" model), selecting the model that performs best on a crossvalidated criterion, will typically also be the best "for real".

A method that is sometimes used to correct somewhat for the lower errors, especially if you are looking for parsimoneous models, is to select the smallest model/simplest method for which the crossvalidated error is within one SD from the (crossvalidated) optimum. As crossvalidation itself, this is a heuristic, so it should be used with some care (if this is an option: make a plot of your errors against your tuning parameters: this will give you some idea whether you have acceptable results)

Given the downward bias of the errors, it is important to not publish the errors or other performance measure from the crossvalidation without mentioning that these come from crossvalidation (although, truth be told: I have seen too many publications that don't mention that the performance measure was obtained from checking the performance on the original dataset either --- so mentioning crossvalidation actually makes your results worth more). For you, this will not be an issue, since you have a validation set.

A final warning: if your model fitting results in some close competitors, it is a good idea to look at their performances on your validation set afterwards, but do not base your final model selection on that: you can at best use this to soothe your conscience, but your "final" model must have been picked before you ever look at the validation set.

Wrt your second question: I believe Simon has given your all the answers you need in his comment, but to complete the picture: as often, it is the bias-variance trade-off that comes into play. If you know that, on average, you will reach the correct result (unbiasedness), the price is typically that each of your individual calculations may lie pretty far from it (high variance). In the old days, unbiasedness was the nec plus ultra, in current days, one has accepted at times a (small) bias (so you don't even know that the average of your calculations will result in the correct result), if it results in lower variance. Experience has shown that the balance is acceptable with 10-fold crossvalidation. For you, the bias would only be an issue for your model optimization, since you can estimate the criterion afterwards (unbiasedly) on the validation set. As such, there is little reason not to use crossvalidation.

Solved – Nested cross-validation for classification in MATLAB

The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate. Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train and SVM-classify) you do not need an outer cross-validation loop.

Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).

Data are the data to be classified (Dimension: Cases x Features)

Labels are the class labels for each case

%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize the classifier performance object
for i = 1:length(Labels)
    test = zeros(size(Labels));
    test(i) = 1; test = logical(test); train = ~test;
    disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))

    %************** Perform feature selection ************
    z0 = [0,0];    % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
    [rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);

    %***************** Outer loop CV *********************
    svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
    class = svmclassify(svmStruct,Data(test,Features{i}));    % updates the CP object with the current classification results
    classperf(Results,class,test);
    Acc_fold(i) = Results.LastCorrectRate;    
    disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
    disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end

function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);

%************ Get optimal results **************
Acc = 1 - Crit;                       % Accuracy for model  
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))


function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';

disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);

Features = find(fs==1);        % Features selected for given sigma and C
[Crit,h] = min(history.Crit);  % Mean classification error

Hope this helps

Best Answer

Related Solutions

Cross Validation – Is Cross Validation a Proper Substitute for Validation Set?

Solved – Nested cross-validation for classification in MATLAB

Related Question