Feature Selection and Cross-Validation – Techniques and Best Practices

cross-validationfeature selection

I have recently been reading a lot on this site (@Aniko, @Dikran Marsupial, @Erik) and elsewhere about the problem of overfitting occuring with cross validation – (Smialowski et al 2010 Bioinformatics, Hastie, Elements of statistical learning).
The suggestion is that any supervised feature selection (using correlation with class labels) performed outside of the model performance estimation using cross validation (or other model estimating method such as bootstrapping) may result in overfitting.

This seems unintuitive to me – surely if you select a feature set and then evaluate your model using only the selected features using cross validation, then you are getting an unbiased estimate of generalized model performance on those features (this assumes the sample under study are representive of the populatation)?

With this procedure one cannot of course claim an optimal feature set but can one report the performance of the selected feature set on unseen data as valid?

I accept that selecting features based on the entire data set may resuts in some data leakage between test and train sets. But if the feature set is static after initial selection, and no other tuning is being done, surely it is valid to report the cross-validated performance metrics?

In my case I have 56 features and 259 cases and so #cases > #features. The features are derived from sensor data.

Apologies if my question seems derivative but this seems an important point to clarify.

Edit:
On implementing feature selection within cross validation on the data set detailed above (thanks to the answers below), I can confirm that selecting features prior to cross-validation in this data set introduced a significant bias. This bias/overfitting was greatest when doing so for a 3-class formulation, compared to as 2-class formulation.
I think the fact that I used stepwise regression for feature selection increased this overfitting; for comparison purposes, on a different but related data set I compared a sequential forward feature selection routine performed prior to cross-validation against results I had previously obtained with feature selection within CV. The results between both methods did not differ dramatically. This may mean that stepwise regression is more prone to overfitting than sequential FS or may be a quirk of this data set.

Best Answer

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));