If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
The in-sample estimate of your prediction is generally overly optimistic. When you test your model on the same data on which it was trained, the model appears to fit very well. You get a more accurate estimate of your model's true predictive ability if you do cross validation - leave one out or k-fold. K-fold is much quicker than leave one out so that's what I use on larger samples.
By adding extra variables to the model, you always fit the sample data better. For example, in a linear regression, $R^{2}$ always increases when you add another variable to the model. That doesn't mean that adding that extra variable was a good idea. After adding too many variables, you end up over fitting the sample and don't do a good job of predicting new data. So, the main purpose of cross validation is to prevent over fitting the model and to ensure that you do the best possible job of prediction on an unseen data set.
Best Answer
The procedure you are using will result in optimistically biased performance estimates, because you use the data from the test set used in steps 2 and 3 to decide which features used in step 1. Repeating the exercise reduces the variance of the performance estimate, not the bias, so the bias will not average out. To get an unbiased performance estimate, the test data must not be used in any way to make choices about the model, including feature selection.
A better approach is to use nested cross-validation, so that the outer cross-validation provides an estimate of the performance obtainable using a method of constructing the model (including feature selection) and the inner cross-validation is used to select the features independently in each fold of the outer cross-validation. Then build your final predictive model using all the data.
As you have more features than cases, you are very likely to over-fit the data simply by feature selection. It is a bit of a myth that feature selection should be expected to improve predictive performance, so if that is what you are interested in (rather than identifying the relevant features as an end in itself) then you are probably better off using ridge regression and not performing any feature selection. This will probably give better predictive performance than feature selection, provided the ridge parameter is selected carefully (I use minimisation of Allen's PRESS statistic - i.e. the leave-one-out estimate of the mean-squared error).
For further details, see Ambroise and McLachlan, and my answer to this question.