MATLAB: Does sequentialfs always outperform cross-validation with selected features

Bioinformatics Toolboxcross-validationsequentialfsStatistics and Machine Learning Toolboxsvmsvmclassify

Why does classification accuracy obtained using sequentialfs and cross-validation always outperform a 10-fold cross-validation using those selected features? Any help would be gratefully received!

Thanks in advance.

Barry

See code below, Acc_fs (77%) is always higher than Acc (67%): This finding holds true for muliple tests – accuracy obtained using sequentialfs always outperforms cross validated accuracy. Is this a bug in my implementation or an issue with sequentialfs.m?

%************** Perform feature selection ************
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter');
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'options',opts);
Acc_fs = 1 - history.Crit(end);
%******* Cross validated classification accuracy *******
Feature_select = find(fs==1);       % Features selected
Vars_select = Variables(fs==1);       % Variable names of features selected
indices = crossvalind('Kfold',Labels,num_folds);
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize 
for i = 1:num_folds
    test = (indices == i); train = ~test;
    svmStruct = svmtrain(Data(train,Feature_select),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);      
    class = svmclassify(svmStruct,Data(test,Feature_select));          
  classperf(Results,class,test);  
end
Acc = Results.CorrectRate;      % Classification accuracy
end

Function SVM_class_fun returns number of misclassified samples:

function MCE = SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint)
svmStruct = svmtrain(x_train,y_train,'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);
y_fit = svmclassify(svmStruct,x_test);
C = confusionmat(y_test,y_fit);
N = sum(sum(C));
MCE = N - sum(diag(C)); % No. misclassified sample
end

Best Answer

I don't know if your code is correct. But accuracy estimates obtained by sequential selection are always biased high.

Consider say 10 random variables. Suppose you wish to find the variable with the largest true mean. Suppose these random variables are identical. Generate a separate sample for each variable. Due to the finite sizes of the samples, their estimated means are not going to be equal. You then choose the sample with the largest average and believe that the respective variable has the largest true mean. But all you did was choose the variable whose estimated mean came out largest by chance. Since the estimated mean is largest, it is likely above the true mean. Then you generate another sample for the chosen variable. Because the true mean is less than the estimated mean, your new estimate is less than your previous estimate.

This is exactly why you need to re-estimate the accuracy by another run of cross-validation after selection is done.

Best Answer

Related Solutions

Related Question