Solved – Forward or backward sequential feature selection

I was trying to carry out feature selection on a dataset using sequential feature selection. The dataset contains more than 5000 observations (rows) and 22 features (columns). Now I see that there are two options to do it. One is 'backward' and the other is 'forward'. I was reading the article 'An Introduction to Variable and Feature Selection' and it is mentioned that both these techniques yield nested subsets of variables.

When I try to do forward selection using the below code:

%% sequentialfs (forward) and knn
rng(100)
c = cvpartition(groups_cv,'k',10);
opts = statset('display','iter');
fun = @(xtrain,ytrain,xtest,ytest) sum(ytest ~= predict(fitcknn(xtrain,ytrain,'NumNeighbors',5),xtest));
[fs,history]  = sequentialfs(fun,data_cv_nor,groups_cv,'cv',c,'options',opts, 'direction', 'forward');

I get the following set of features.

Final columns included: 8 9 11 12 14 17 19

And when I try to do backward selection using the below code:

%% sequentialfs (backward) and knn
rng(100)
c = cvpartition(groups_cv,'k',10);
opts = statset('display','iter');
fun = @(xtrain,ytrain,xtest,ytest) sum(ytest ~= predict(fitcknn(xtrain,ytrain,'NumNeighbors',5),xtest));
[fs,history]  = sequentialfs(fun,data_cv_nor,groups_cv,'cv',c,'options',opts, 'direction', 'backward');

I get the following set of features.

Final columns included: 2 3 6 7 8 11 12 14 16 17 18 19 21 22

Clearly, backward technique has selected more number of features, and both these techniques share less number of common elements.

My questions are:

Among these two techniques which one is preferable? How do I choose?
If I change the seed, then it gives me a different set of features. How do I go about this? Is something like 100 times 10-fold cross validation preferable in this case, i.e., running the same code 100 times with different seeds and observing the most frequent feature?

The above question is based on one of the reviewer comments for my research article which mentions

"The major limitation I see is that authors did not try any of the state-of-the-art feature engineering algorithms for the selection of variables, prior to the application of the four selected algorithms (LR, RF, BoostTree, SVM). While they did a statistical testing of the features, I think that being able to have similar performances with many less variables could be specially interesting in this setting. Most of the classifier-oriented research do this type of preprocessing and compare the runs with all variables and some subsets of variables as selected by different algorithms. For matlab authors can have a initial look at:
https://www.mathworks.com/discovery/feature-selection.html"

Best Answer

The facts that you are getting different answers from forward and backward selection, and that you get different answers when you change the seed, should give you pause. Clearly, these can't all be right. Most likely, none of them are. The simplest answer is that you should not use these methods at all. Here are some threads you might want to read:

In place of these methods, you might stop and ask why you need to select variables at all. Twenty-two variables with 5,000 data should present no real problems for most things. You could also read this:

What are modern, easily used alternatives to stepwise regression?

Best Answer

Related Solutions

Solved – the criterion value on sequential feature selection for binary classification

Feature Selection Techniques – Differences Between Forward, Backward, and Bidirectional Methods

Related Question