If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
Best Answer
To answer your main question, it would be optimal and more appropiate to scale within the CV. But it will probably not matter much and might not be important in practice at all if your classifier rescales the data, which most do (at least in R).
However, selecting feature before cross validating is a BIG NO and will lead to overfitting, since you will select them based on how they perform on the whole data set. The log-transformation is ok to perform outside, since the transformation does not depend on the actual data (more on the type of data) and is not something you would not do if you had only 90% of the data instead of 100% and is not tweaked according to the data.
To also answer your comment, obviously whether it will result in overfitting will depend on your manner of feature selection. If you choose them by chance (why would you do that?) or because of a priori theoretical considerations (other literature) it won't matter. But if it depends on your data set it will. Elements of Statistical Learnings has a good explanation. You can freely and legally download a .pdf here http://www-stat.stanford.edu/~tibs/ElemStatLearn/
The point concerning you is in section 7.10.2 on page 245 of the fifth printing. It is titled "The Wrong and Right Ways to do Cross-validation".