If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
I see 3 potential problems with this approach. First, if you intend to use your model for classifying new cases, your variable-selection procedure might lead to a choice of variables too closely linked to peculiarities of this initial data set. Second, the training/test set approach might not be making the most efficient use of the data you have. Third, you might want to reconsider your metric for evaluating models.
First, variable selection tends to find variables that work well for a particular data set but don't generalize well. It's fascinating and frightening to take a variable selection scheme (best subset as you have done, or even LASSO) and see how much the set of selected variables differs just among bootstrap re-samples from the same data set, particularly when many predictors are inter-correlated.
For this application, where many of your predictors seem to be correlated, you might be better off taking an approach like ridge regression that treats correlated predictors together. Some initial pruning of your 766 features might still be wise (maybe better based on subject-matter knowledge than on automated selection), or you could consider an elastic net hybrid of LASSO with ridge regression to get down to a reasonable number of predictors. But when you restrict yourself to a handful of predictors you risk throwing out useful information from other potential predictors in future applications.
Second, you may be better off using the entire data set to build the model and then using bootstrapping to estimate its generalizability. For example, you could use cross-validation on the entire data set to find the best choice of penalty for ridge regression, then apply that choice to the entire data set. You would then test the quality of your model on bootstrap samples of the data set. That approach tends to maximize the information that you extract from the data, while still documenting its potential future usefulness.
Third, your focus on classification accuracy makes the hidden assumption that both types of classification errors have the same cost and that both types of classification successes have the same benefit. If you have thought hard about this issue and that is your expert opinion, OK. Otherwise, you might consider a different metric for, say, choosing the ridge-regression penalty during cross-validation. Deviance might be a more generally useful metric, so that you get the best estimates of predicted probabilities and then can later consider the cost-benefit tradeoffs in the ultimate classification scheme.
In terms of avoiding overfitting, the penalty in ridge regression means that the effective number of variables in the model can be many fewer than the number of variables nominally included. With only 42 of the least-common case you were correct to end up with only 3 features (about 15 of the least-common case per selected feature). The penalization provided by ridge regression, if chosen well by cross validation, will allow you to combine information from more features in a way that is less dependent on the peculiarities of your present data set while avoiding overfitting and being generalizable to new cases.
Best Answer
I had similar experience as yours with real-life data. Boruta does not give you any guarantees, you should treat it's output rather as a "suggestion", then definite answer.
This was even discussed by Kursa and Rudnicki (2010) in their paper about Boruta:
You could try also other methods, e.g. entropy-based (check FSelectorRcpp project).
Feature selection algorithms are far from perfect. Marcin KosiĆski compared performance of three different methods and got three different solutions from each.
(source: r-addict.com)
Kursa, M.B., & Rudnicki, W.R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1-12.