Solved – How to use 10-fold Cross Validation in Feature Selection

cross-validationfeature selection

I would like to run a 10-fold cross validation on a number of different feature selection tools. For some tools, you can specify k-fold in the Python module (i.e., LassoLarsCV(cv=10)), but others it is not clear how to implement the cross-validation.

Let's assume, I divided my data into 10 random splits and run the feature selection in each fold. Doing so, there will be some set of variables (many same ones as well as new ones) in each fold. How do you cross-validate these nominal outcomes? They are not means or anything so we can take the average of the 10 fold, but all we have is a different number of variables as a result of validation in each fold. In other words, how can I validate the ideal set of variables in cross-validation procedure? Taking the features that are consistently found in each fold?

Best Answer

When you perform $k$-fold cross validation, you split the data equally and randomly into $k$ splits. Now you,

  1. Take $i^{th}$ split as validation set, and combine the rest $k-1$ splits
  2. Train on the $k-1$ splits combined, test on the validation set

Do this for $i = 1,..., k$ and note the average error. Repeat all these steps for each potential set of features, and then choose the set that gave you the lowest average error. Note that this requires you to go through $2^n$ combinations, where $n$ is the total number of features. If you can assume independence among the features, you can select them in a greedy fashion: you start with choosing just one feature. See which one among the $n$ features gives you the lowest error, and then keeping that constant, add one more from the remaining, do this $n-1$ times for the $n-1$ remaining features, and so on, until the error either doesn't decrease anymore, or the decrease is too low to offset the cost of increasing your feature space.

Related Question