Feature Selection – Final Model Feature Selection During Cross-Validation in Machine Learning

classificationcross-validationfeature selectiongeneticsmachine learning

I am getting a bit confused about feature selection and machine learning
and I was wondering if you could help me out. I have a microarray dataset that is
classified into two groups and has 1000s of features. My aim is to get a small number of genes (my features) (10-20) in a signature that I will in theory be able to apply to
other datasets to optimally classify those samples. As I do not have that many samples (<100), I am not using a test and training set but using Leave-one-out cross-validation to help
determine the robustness. I have read that one should perform feature selection for each split of the samples i.e.

  1. Select one sample as the test set
  2. On the remaining samples perform feature selection
  3. Apply machine learning algorithm to remaining samples using the features selected
  4. Test whether the test set is correctly classified
  5. Go to 1.

If you do this, you might get different genes each time, so how do you
get your "final" optimal gene classifier? i.e. what is step 6.

What I mean by optimal is the collection of genes that any further studies
should use. For example, say I have a cancer/normal dataset and I want
to find the top 10 genes that will classify the tumour type according to
an SVM. I would like to know the set of genes plus SVM parameters that
could be used in further experiments to see if it could be used as a
diagnostic test.

Best Answer

Whether you use LOO or K-fold CV, you'll end up with different features since the cross-validation iteration must be the most outer loop, as you said. You can think of some kind of voting scheme which would rate the n-vectors of features you got from your LOO-CV (can't remember the paper but it is worth checking the work of Harald Binder or Antoine Cornuéjols). In the absence of a new test sample, what is usually done is to re-apply the ML algorithm to the whole sample once you have found its optimal cross-validated parameters. But proceeding this way, you cannot ensure that there is no overfitting (since the sample was already used for model optimization).

Or, alternatively, you can use embedded methods which provide you with features ranking through a measure of variable importance, e.g. like in Random Forests (RF). As cross-validation is included in RFs, you don't have to worry about the $n\ll p$ case or curse of dimensionality. Here are nice papers of their applications in gene expression studies:

  1. Cutler, A., Cutler, D.R., and Stevens, J.R. (2009). Tree-Based Methods, in High-Dimensional Data Analysis in Cancer Research, Li, X. and Xu, R. (eds.), pp. 83-101, Springer.
  2. Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19): 2507-2517.
  3. Díaz-Uriarte, R., Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7:3.
  4. Diaz-Uriarte, R. (2007). GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics, 8: 328

Since you are talking of SVM, you can look for penalized SVM.