Solved – Grid search for SVM parameters; is this is really how it is done

cross-validationfeature selectionsvm

Suppose I use nested 10-fold cross-validation with SVM. So, the inner-most loop will go around 100 times. Now, suppose I use a gaussian radial basis kernel function, which needs the parameter sigma. Moreover, I need to find the optimal C parameter for SVM (it is called 'boxconstraint' in Matlab). To find the optimal (sigma, C) pair, I would need to train 100*100 = 10000 SVMs for each fold. So in the end there would be about one million trained SVMs (performance estimation+parameter selection). Is this really done like this?

What if I add feature selection. I might have to train and test 10 million SVMs. Should I apply for a super computer? What's the limit of an ordinary 8-core machine in terms of doing this kind selection?

Best Answer

Yes, SVM hyper-parameters are often trained using grid search, and it is a pretty reasonable way to go about model selection provided you only have a few hyper-parameters to optimise. It isn't quite as expensive as the question suggests because if you change the hyper-parameters slightly, you don't have to start the training procedure (e.g. SMO) from scratch each time, instead you can start from the optimal solution from the last set of hyper-parameters you looked at ("warm start"). If you evaluate the grid so that each row represents different values of C and each column different values of the RBF scale parameter, then if you optimise each row in turn, then you can even keep the cached evaluations of the kernel function and only need to flush the cache at the start of each column. Alternatively you can use an algorithm that evaluates the whole "regularisation path" (i.e. a whole row) in one go.

Grid search is impractical for problems with a large number of hyper-parameters, so I use the Nelder-Mead simplex algorithm, that can find the solution of minimisation procedures without evaluating gradient information (which works pretty well). Alternatively you can use gradient descent optimisation (optimising a bound on the leave-one-out cross-validation error is also a good alternative to regular cross-validation).

I would generally recommend against feature selection for SVMs (unless identifying the relevant features is of intrinsic interest) as it often makes performance worse rather than better. The SVM is an approximate implementation of a bound on generalisation performance which is independent of dimensionality, so there is good reason to think they will work well without the need for feature selection (provided C is carefully tuned).

I do this kind of thing all the time, and it is computationally expensive. However the key to using kernel methods correctly lies in rigorous tuning of the kernel and hyper-parameters, so the expense is well justified as it is what is required for "best practice".

Related Solutions

Solved – Nested cross-validation for classification in MATLAB

The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate. Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train and SVM-classify) you do not need an outer cross-validation loop.

Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).

Data are the data to be classified (Dimension: Cases x Features)

Labels are the class labels for each case

%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize the classifier performance object
for i = 1:length(Labels)
    test = zeros(size(Labels));
    test(i) = 1; test = logical(test); train = ~test;
    disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))

    %************** Perform feature selection ************
    z0 = [0,0];    % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
    [rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);

    %***************** Outer loop CV *********************
    svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
    class = svmclassify(svmStruct,Data(test,Features{i}));    % updates the CP object with the current classification results
    classperf(Results,class,test);
    Acc_fold(i) = Results.LastCorrectRate;    
    disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
    disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end

function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);

%************ Get optimal results **************
Acc = 1 - Crit;                       % Accuracy for model  
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))


function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';

disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);

Features = find(fs==1);        % Features selected for given sigma and C
[Crit,h] = min(history.Crit);  % Mean classification error

Hope this helps

Solved – How to use cross validation for model comparison

Are the previously mentioned steps follow any standard procedure? Yes! You are using hold-out validation set for final classifier comparison and k-fold cross-validation for the parameter (model) selection.

If not, how can I use Repeated/Nested CV in my case? Since you are considering different models, one way to improve that would be :

For each method

Use k-fold cross-validation for model selection
After selecting the optimal parameters (model fitting), use k-fold cross-validation to get the generalisation error.

This gives you the variation in errors in different folds, so you can calculate, variance (or standard deviation) to report on the reliability/consistency of the model, or even generate some plot.

UPDATE

You don't need to split the data for step 1 and step 2. Use 10000 data points in k-fold cross-validation, i.e., if k = 10, then you will use 9000 for training and 1000 for validation for model selection. Once model is selected again use the same 10000 samples in the similar k-fold cross-validation but this time your parameters will be fixed.

You can choose to run k-fold cross-validation once and get k error measures for each of the subset; 2*k if you consider training set which you could also look into. So, with those k or 2*k values you can perform some statistical tests or draw some plots. It is also good to repeat the cross-validation process n times, giving you n *k error measures for statistical analysis.

Best Answer

Related Solutions

Solved – Nested cross-validation for classification in MATLAB

Solved – How to use cross validation for model comparison

Related Question