Solved – Feature selection using cross validation

cross-validationfeature selectionmachine learning

I am dealing with a typical $p > n$ problem in the medical field. (typically $p \approx 3700$ and $n \approx 100$ ). The dependent variable is binary (healthy/sick) and features are continuous variables representing intensities of a large set of bio-markers. The bio-markers (i.e. features) are extracted from samples using a feature selection algorithm developed internally. By the nature of the algorithm, the extracted features are dependent on the data set.

Goal is to extract features which are relevant for the outcome of the conditional variable (healthy/sick).

Process: Since we are dealing with little sample sizes, we suggest to use cross validation for the feature selection, rather than applying the algorithm to the whole set, as follows:

  1. Split original data into testing (10%)/training(90%) data sets.
  2. Split training data set 10 times into 10 folds (CV).
  3. In each step of CV apply feature selection algorithm, followed by a non-parametric U-test and sort by FDR adjusted p-values.
  4. In each step select features with FDR adjusted p-value $\leq 0.05$.
  5. After CV is finished, collect features from every CV step into one final data set and order them by the number appearances over all steps of CV.
  6. Select top $k$ (e.g. $k = 10$) features from the final list.
  7. Develop a classifier (e.g. SVM) by training the selected features from step 6 on the complete training data set.
  8. Test predictive performance of selected features + model by measuring accuracy on the testing data from step 1.

Questions:

  1. Does the above procedur make sense? Is there any reference in literature for only selecting features using CV (steps 2 – 6)?

  2. Following https://www.sciencedirect.com/science/article/pii/S0933365715001426 we do not want to train any classifier during the steps in CV. However, if we would use the final list of features and fed it back into CV to train a classier at every step, would that not also introduce bias?

Thank you.

Edit In order to avoid lengthy comments below, the question boils down to: Can CV be used purely for feature selection (reduction) as in steps 3-6?

Best Answer

Although one might use CV for feature selection as you propose, the proposed approach might not be the wisest way to proceed, particularly with your limited data set.

There's already a problem with Steps 1 and 8. A clean test/training separation (Step 1) can be helpful with very large data sets, but there's a serious problem in this case: your completely held-out test set would only be 10 cases, with at most 5 in the smaller of the healthy/sick classes. That's not going to provide, in Step 8, a very rigorous test of the quality of your feature selection. That problem of limited sample size carries over into the CV proposed in Step 2 and following, where you are now down to only 9 held-out cases in each fold. If your smaller class is only at 20-30% prevalence, you will often have no held-out test members in the smaller class in CV folds, limiting its usefulness.

The second problem is the idea that you can use CV "purely for feature selection" without "measuring performance of the process." The "most significant features" are precisely those that provide "a usable model" for describing the population of interest. That assessment must be based on some measure of how well a model based on the selected features represents the data, even if prediction is not your intended future use of the model. Isn't that what you do in effect for every fold of the proposed CV?

In a fundamental sense, what you validate with CV or almost any statistical tool is the combination of the process and the result, not just the result (in your case, the selected features). For example a p-value in frequentist statistics represents the probability that, if you repeated the experiment (process) a large number of times, you would get a result so extreme as what you observed based simply on chance. And if you want to select 10 features based on your present data sample, wouldn't you want some assurance that the result is likely to be useful with another sample from the same population?

The guiding principle for validating feature selection while minimizing bias is that all steps of the modeling/selection should be repeated for all folds of CV or bootstrapping. This issue is discussed in many threads on this site, with this and this being two threads particularly relevant to your problem.

You could apply this principle to your problem, but with the limited data available a change of perspective might help: accept that some bias is inevitable resulting from feature selection, but do your best to estimate its magnitude. Find the best set of features based on the entire data set and then use CV or, perhaps better, bootstrapping* to estimate generalizability and bias. Your selected features will be based directly on all the information you have available, and you document to your audience how well your overall modeling approach appears to work.

One more warning: with 100 cases you have no more than 50 in the smaller class. A general rule of thumb for classification is that you are in danger of overfitting/bias if you choose more unpenalized predictors than 1/10 to 1/20 of the number of cases in the smaller class. So if you go for any more than about 5 predictors in your case and aren't penalizing them in some way, you may be in danger of overfitting. You might consider the number of features to select as an object for CV, as is done with LASSO.


*Bootstrapping is useful for estimating how well the combination of process and result would generalize to the population as a whole. The principle is that each bootstrapped sample is related to the data sample at hand as the data sample at hand is related to the whole population. From each of many bootstrapped samples, perform your feature selection algorithm and evaluate its performance on the full data sample in terms of accuracy, bias, or whatever quality measures you use. The pooled quality estimate among the multiple models from bootstrapping against the data sample thus is an estimate of how well your feature selection based on the data sample represents the entire population.