The key thing to remember is that for cross-validation to give an (almost) unbiased performance estimate every step involved in fitting the model must also be performed independently in each fold of the cross-validation procedure. The best thing to do is to view feature selection, meta/hyper-parameter setting and optimising the parameters as integral parts of model fitting and never do any one of these steps without doing the other two.
The optimistic bias that can be introduced by departing from that recipe can be surprisingly large, as demonstrated by Cawley and Talbot, where the bias introduced by an apparently benign departure was larger than the difference in performance between competing classifiers. Worse still biased protocols favours bad models most strongly, as they are more sensitive to the tuning of hyper-parameters and hence are more prone to over-fitting the model selection criterion!
Answers to specific questions:
The procedure in step 1 is valid because feature selection is performed separately in each fold, so what you are cross-validating is whole procedure used to fit the final model. The cross-validation estimate will have a slight pessimistic bias as the dataset for each fold is slightly smaller than the whole dataset used for the final model.
For 2, as cross-validation is used to select the model parameters then you need to repeat that procedure independently in each fold of the cross-validation used for performance estimation, you you end up with nested cross-validation.
For 3, essentially, yes you need to do nested-nested cross-validation. Essentially you need to repeat in each fold of the outermost cross-validation (used for performance estimation) everything you intend to do to to fit the final model.
For 4 - yes, if you have a separate hold-out set, then that will give an unbiased estimate of performance without needing an additional cross-validation.
If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
Best Answer
There isn't a single "best" answer to your question. There are many threads throughout CV that discuss these issues. Here's one ...
Understanding how good a prediction is, in logistic regression
Key points to note are that training or calibration data provides useful information regarding optimal fit (e.g., Harrell recommends using a nonparametric loess smoother for this) but is well known to be positively biased:
Regardless, the emphasis in machine learning is on using validation (out-of-sample) data to evaluate fit. Validation results provide useful information regarding the bias-variance tradeoff that can't be inferred from calibration information.
If your validation error and training error are both high, you have underfitting (bias).
If your validation error is high but training error is low, you have overfitting (variance).
Here's one discussion of this:
Question about bias-variance tradeoff