Solved – Error and Dispersion meaning in tune.out for SVM Classifier

cross-validatione1071svm

I am using a SVM to solve a binary classification problem with qualitative response as output.

To find out the best parameters for the SVM I used a 10-fold cross-validation technique. And the result of the process was (under RStudio and R):

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 cost
    5

- best performance: 0.25 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.4833333  0.2415229
2 1e-02 0.4833333  0.2415229
3 1e-01 0.3500000  0.1657382
4 1e+00 0.2666667  0.1405457
5 5e+00 0.2500000  0.1416394
6 1e+01 0.2666667  0.1791613
7 1e+02 0.2666667  0.1791613

What I am asking to myself is what are the error and dispersion, and how are they calculated?

I tried to answer: Is the error the average MSE of the ten estimates of the test errors? I think not because the classification problem has a qualitative response, and the CV-error-rate should be calculated on misclassified observations.

I am bit confused about this.

Best Answer

If you dig into the code of tune, you'll find that it calculates error for each of the surrogate models, and then aggregates these per-model error estimates into a point estimate (that is reported in your summary as error) and dispersion.

For classification, the surrogate-model error estimate is fraction of correctly predicted among all predictions = accuracy.
the aggregation function for the point estimate is tunecontrol$sampling.aggregate which defaults to mean,
the aggregation function for the dispersion is tunecontrol$sampling.dispersion, defaulting to sd.

See also the man page of tune.control().

Related Solutions

Solved – SVM parameter selection and cross validation

There is a recently proposed method to speed up grid search: "Fast Cross validation via sequential analysis"

http://www.scribd.com/doc/76134034/Fast-Cross-Validation-Via-Sequential-Analysis-Talk

Basically, they're doing a normal grid search, but try to eliminate bad parameters early in the process and not waste too much computation on them. It's fairly new and I don't know independent evaluations of their method, but I'm currently implementing it and want to give it a try.

Solved – Nested cross-validation for classification in MATLAB

The purpose of the outer cross-validation (CV) is to get an estimate of the classifier's performance on genuinely unseen data. If the hyperparameters are tuned based on a cross-validation statistic this can lead to a biased performance estimate and so an outer loop, which was not involved in any aspect of feature or model selection is needed to determine the performance estimate. Conversely if you do not tune the hyperparameters (and use default hyperparameters in SVM_train and SVM-classify) you do not need an outer cross-validation loop.

Here is an example of some code that will implement nested CV, this implementation uses Nelder-Mead optimization (NMO) and sequential forward feature selection in the inner loop to find the optimum feature set and hyperparameters (box-constraint (C) and RBF sigma).

Data are the data to be classified (Dimension: Cases x Features)

Labels are the class labels for each case

%************** Nested cross-validation ******************
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize the classifier performance object
for i = 1:length(Labels)
    test = zeros(size(Labels));
    test(i) = 1; test = logical(test); train = ~test;
    disp(sprintf('Fold: %d of %d.\n',i,length(Labels)))

    %************** Perform feature selection ************
    z0 = [0,0];    % z=[rbf_sigma,boxconstraint] - set to default exp(z) = [0,0]
    [rbf_sigma_Acc(i) boxconstraint_Acc(i) maxAcc Features{i}] = SVM_NMO(z0,Data(train,:),Labels(train),num_folds);

    %***************** Outer loop CV *********************
    svmStruct = svmtrain(Data(train,Features{i}),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma_Acc(i),'boxconstraint',boxconstraint_Acc(i));
    class = svmclassify(svmStruct,Data(test,Features{i}));    % updates the CP object with the current classification results
    classperf(Results,class,test);
    Acc_fold(i) = Results.LastCorrectRate;    
    disp(sprintf('Test set Accuracy (Fold %d): %2.2f',i,Acc_fold(i)))
    disp(sprintf('Test set Accuracy (running mean): %2.2f\n',100*Results.CorrectRate))
end

function [rbf_sigma boxconstraint Acc Features_opt] = SVM_NMO(z0,Data,Labels,num_folds)
opts = optimset('TolX',1e-1,'TolFun',1e-1);
fun = @(z)SVM_min_fn(Data,Labels,exp(z(1)),exp(z(2)),num_folds);
[z_opt,Crit] = fminsearch(fun,z0,opts);
[~, Features_opt] = fun(z_opt);

%************ Get optimal results **************
Acc = 1 - Crit;                       % Accuracy for model  
rbf_sigma = exp(z_opt(1));
boxconstraint = exp(z_opt(2));
disp(sprintf('Max Acc: %2.2f, RBF sigma: %1.2f. Boxconstraint: %1.2f',Acc,rbf_sigma,boxconstraint))


function [Crit Features] = SVM_min_fn(Data,Labels,rbf_sigma,boxconstraint,num_folds)
direction = 'forward';
opts = statset('display','iter');
kernel = 'rbf';

disp(sprintf('RBF sigma: %1.4f. Boxconstraint: %1.4f',rbf_sigma,boxconstraint))
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter','TolFun',1e-3);
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'direction',direction,'options',opts);

Features = find(fs==1);        % Features selected for given sigma and C
[Crit,h] = min(history.Crit);  % Mean classification error

Hope this helps

Best Answer

Related Solutions

Solved – SVM parameter selection and cross validation

Solved – Nested cross-validation for classification in MATLAB

Related Question