Solved – Is using the same data for feature selection and cross-validation biased or not

cross-validationfeature selectionmachine learningtrain

We have a small dataset (about 250 samples * 100 features) on which we want to build a binary classifier after selecting the best feature subset. Lets say that we partition the data into:

Training, Validation and Testing

For feature selection, we apply a wrapper model based on selecting features optimizing performance of classifiers X, Y and Z, separately. In this pre-processing step, we use training data for training the classifiers and validation data for evaluating every candidate feature subset.

At the end, we want to compare the different classifiers (X, Y and Z). Of course, we can use the testing part of the data to have a fair comparison and evaluation. However in my case, the testing data would be really small (around 10 to 20 samples) and thus, I want to apply cross-validation for evaluating the models.

The distribution of the positive and negative examples is highly ill-balanced (about 8:2). So, a cross-validation could miss-lead us in evaluating the performance. To overcome this, we plan to have the testing portion (10-20 samples) as a second comparison method and to validate the cross-validation.

In summary, we are partitioning data into training, validation and testing. Training and validation parts are to be used for feature selection. Then, cross-validation over the same data is to be applied to estimate the models. Finally, testing is used to validate the cross-validation given the imbalance of the data.

The question is: If we use the same data (training+validation) used in selecting the features optimizing the performance of classifiers X, Y and Z, can we apply cross-validation over the same data (training+validation) used for feature selection to measure the final performance and compare the classifiers?

I do not know if this setting could lead to a biased cross-validation measure and result in un-justified comparison or not.

Best Answer

i think it is biased. What about applying FS in N-1 partition and test on last partition. and combine the features from all fold in some way(union/intersection/ or some problem specific way).

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Solved – Forward search feature selection and cross-validation

Your second procedure assumes you have some other feature selection algorithm (for example, stepwise regression with some stopping rule), distinct from the cross-validation. If you don't have this, you'll just have to use the first procedure (where cross-validation is the whole feature-selection algorithm).

Also, even if the second procedure is applicable, the first procedure might do better. In the second procedure, a greedy feature-selection algorithm might always pick models that are overfit to the training data. Then the CV would only let you choose among these bad models. This shouldn't happen in the first procedure.

On the other hand, if your problem does have a specialized feature-selection algorithm which is computationally-efficient, then the second procedure may run much faster than the first.

If you do use the second procedure, one way to choose a best feature set is to let CV choose the model size. At every model size, you might compare different models on each data split, but average their test errors across all splits. This way, you can use CV to decide which model size gives the best estimated performance. Finally, rerun your feature-selection algorithm on the full dataset, up to the size chosen by CV, and use this as the final feature set.

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Solved – Forward search feature selection and cross-validation

Related Question