Solved – Should feature selection be performed only on training data (or all data)

cross-validationdatasetexperiment-designfeature selection

Should be feature selection performed only on training data (or all data)? I went through some discussions and papers such as Guyon (2003) and Singhi and Liu (2006), but still not sure about right answer.

My experiment setup is as follows:

  • Dataset: 50-healthy controls & 50-disease patients (cca 200 features that can be relevant to disease prediction).
  • Task is to diagnose disease based on available features.

What I do is

  1. Take whole dataset and perform feature selection(FS). I keep only selected features for further processing
  2. Split to test and train, train classifier using train data and selected features. Then, apply classifier to test data (again using only selected features). Leave-one-out validation is used.
  3. obtain classification accuracy
  4. Averaging: repeat 1)-3) N times. $N=50$ (100).

I would agree that doing FS on whole dataset can introduce some bias, but my opinion is that it is "averaged out" during averaging (step 4). Is that correct? (Accuracy variance is $<2\%$)

1 Guyon, I. (2003) "An Introduction to Variable and Feature Selection", The Journal of Machine Learning Research, Vol. 3, pp. 1157-1182
2 Singhi, S.K. and Liu, H. (2006) "Feature Subset Selection Bias for Classification Learning", Proceeding ICML '06 Proceedings of the 23rd international conference on Machine learning, pp. 849-856

Best Answer

The procedure you are using will result in optimistically biased performance estimates, because you use the data from the test set used in steps 2 and 3 to decide which features used in step 1. Repeating the exercise reduces the variance of the performance estimate, not the bias, so the bias will not average out. To get an unbiased performance estimate, the test data must not be used in any way to make choices about the model, including feature selection.

A better approach is to use nested cross-validation, so that the outer cross-validation provides an estimate of the performance obtainable using a method of constructing the model (including feature selection) and the inner cross-validation is used to select the features independently in each fold of the outer cross-validation. Then build your final predictive model using all the data.

As you have more features than cases, you are very likely to over-fit the data simply by feature selection. It is a bit of a myth that feature selection should be expected to improve predictive performance, so if that is what you are interested in (rather than identifying the relevant features as an end in itself) then you are probably better off using ridge regression and not performing any feature selection. This will probably give better predictive performance than feature selection, provided the ridge parameter is selected carefully (I use minimisation of Allen's PRESS statistic - i.e. the leave-one-out estimate of the mean-squared error).

For further details, see Ambroise and McLachlan, and my answer to this question.