The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate.
Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train
and SVM-classify
) you do not need an outer cross-validation loop.
Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).
Data
are the data to be classified (Dimension: Cases x Features)
Labels
are the class labels for each case
%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0); % Initialize the classifier performance object
for i = 1:length(Labels)
test = zeros(size(Labels));
test(i) = 1; test = logical(test); train = ~test;
disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))
%************** Perform feature selection ************
z0 = [0,0]; % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
[rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);
%***************** Outer loop CV *********************
svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
class = svmclassify(svmStruct,Data(test,Features{i})); % updates the CP object with the current classification results
classperf(Results,class,test);
Acc_fold(i) = Results.LastCorrectRate;
disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end
function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);
%************ Get optimal results **************
Acc = 1 - Crit; % Accuracy for model
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))
function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';
disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);
Features = find(fs==1); % Features selected for given sigma and C
[Crit,h] = min(history.Crit); % Mean classification error
Hope this helps
Are the previously mentioned steps follow any standard procedure?
Yes! You are using hold-out validation set for final classifier comparison and k-fold cross-validation for the parameter (model) selection.
If not, how can I use Repeated/Nested CV in my case?
Since you are considering different models, one way to improve that would be :
For each method
- Use k-fold cross-validation for model selection
- After selecting the optimal parameters (model fitting), use k-fold cross-validation to get the generalisation error.
This gives you the variation in errors in different folds, so you can calculate, variance (or standard deviation) to report on the reliability/consistency of the model, or even generate some plot.
UPDATE
You don't need to split the data for step 1 and step 2. Use 10000 data points in k-fold cross-validation, i.e., if k = 10, then you will use 9000 for training and 1000 for validation for model selection. Once model is selected again use the same 10000 samples in the similar k-fold cross-validation but this time your parameters will be fixed.
You can choose to run k-fold cross-validation once and get k
error measures for each of the subset; 2*k
if you consider training set which you could also look into. So, with those k or 2*k
values you can perform some statistical tests or draw some plots. It is also good to repeat the cross-validation process n
times, giving you n *k
error measures for statistical analysis.
Best Answer
Yes, SVM hyper-parameters are often trained using grid search, and it is a pretty reasonable way to go about model selection provided you only have a few hyper-parameters to optimise. It isn't quite as expensive as the question suggests because if you change the hyper-parameters slightly, you don't have to start the training procedure (e.g. SMO) from scratch each time, instead you can start from the optimal solution from the last set of hyper-parameters you looked at ("warm start"). If you evaluate the grid so that each row represents different values of C and each column different values of the RBF scale parameter, then if you optimise each row in turn, then you can even keep the cached evaluations of the kernel function and only need to flush the cache at the start of each column. Alternatively you can use an algorithm that evaluates the whole "regularisation path" (i.e. a whole row) in one go.
Grid search is impractical for problems with a large number of hyper-parameters, so I use the Nelder-Mead simplex algorithm, that can find the solution of minimisation procedures without evaluating gradient information (which works pretty well). Alternatively you can use gradient descent optimisation (optimising a bound on the leave-one-out cross-validation error is also a good alternative to regular cross-validation).
I would generally recommend against feature selection for SVMs (unless identifying the relevant features is of intrinsic interest) as it often makes performance worse rather than better. The SVM is an approximate implementation of a bound on generalisation performance which is independent of dimensionality, so there is good reason to think they will work well without the need for feature selection (provided C is carefully tuned).
I do this kind of thing all the time, and it is computationally expensive. However the key to using kernel methods correctly lies in rigorous tuning of the kernel and hyper-parameters, so the expense is well justified as it is what is required for "best practice".