Solved – Is using the same data for feature selection and cross-validation biased or not

cross-validationfeature selectionmachine learningtrain

We have a small dataset (about 250 samples * 100 features) on which we want to build a binary classifier after selecting the best feature subset. Lets say that we partition the data into:

Training, Validation and Testing

For feature selection, we apply a wrapper model based on selecting features optimizing performance of classifiers X, Y and Z, separately. In this pre-processing step, we use training data for training the classifiers and validation data for evaluating every candidate feature subset.

At the end, we want to compare the different classifiers (X, Y and Z). Of course, we can use the testing part of the data to have a fair comparison and evaluation. However in my case, the testing data would be really small (around 10 to 20 samples) and thus, I want to apply cross-validation for evaluating the models.

The distribution of the positive and negative examples is highly ill-balanced (about 8:2). So, a cross-validation could miss-lead us in evaluating the performance. To overcome this, we plan to have the testing portion (10-20 samples) as a second comparison method and to validate the cross-validation.

In summary, we are partitioning data into training, validation and testing. Training and validation parts are to be used for feature selection. Then, cross-validation over the same data is to be applied to estimate the models. Finally, testing is used to validate the cross-validation given the imbalance of the data.

The question is: If we use the same data (training+validation) used in selecting the features optimizing the performance of classifiers X, Y and Z, can we apply cross-validation over the same data (training+validation) used for feature selection to measure the final performance and compare the classifiers?

I do not know if this setting could lead to a biased cross-validation measure and result in un-justified comparison or not.

Best Answer

i think it is biased. What about applying FS in N-1 partition and test on last partition. and combine the features from all fold in some way(union/intersection/ or some problem specific way).