Solved – How to use 10-fold Cross Validation in Feature Selection

cross-validationfeature selection

I would like to run a 10-fold cross validation on a number of different feature selection tools. For some tools, you can specify k-fold in the Python module (i.e., LassoLarsCV(cv=10)), but others it is not clear how to implement the cross-validation.

Let's assume, I divided my data into 10 random splits and run the feature selection in each fold. Doing so, there will be some set of variables (many same ones as well as new ones) in each fold. How do you cross-validate these nominal outcomes? They are not means or anything so we can take the average of the 10 fold, but all we have is a different number of variables as a result of validation in each fold. In other words, how can I validate the ideal set of variables in cross-validation procedure? Taking the features that are consistently found in each fold?

Best Answer

When you perform $k$-fold cross validation, you split the data equally and randomly into $k$ splits. Now you,

Take $i^{th}$ split as validation set, and combine the rest $k-1$ splits
Train on the $k-1$ splits combined, test on the validation set

Do this for $i = 1,..., k$ and note the average error. Repeat all these steps for each potential set of features, and then choose the set that gave you the lowest average error. Note that this requires you to go through $2^n$ combinations, where $n$ is the total number of features. If you can assume independence among the features, you can select them in a greedy fashion: you start with choosing just one feature. See which one among the $n$ features gives you the lowest error, and then keeping that constant, add one more from the remaining, do this $n-1$ times for the $n-1$ remaining features, and so on, until the error either doesn't decrease anymore, or the decrease is too low to offset the cost of increasing your feature space.

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Solved – Forward search feature selection and cross-validation

Your second procedure assumes you have some other feature selection algorithm (for example, stepwise regression with some stopping rule), distinct from the cross-validation. If you don't have this, you'll just have to use the first procedure (where cross-validation is the whole feature-selection algorithm).

Also, even if the second procedure is applicable, the first procedure might do better. In the second procedure, a greedy feature-selection algorithm might always pick models that are overfit to the training data. Then the CV would only let you choose among these bad models. This shouldn't happen in the first procedure.

On the other hand, if your problem does have a specialized feature-selection algorithm which is computationally-efficient, then the second procedure may run much faster than the first.

If you do use the second procedure, one way to choose a best feature set is to let CV choose the model size. At every model size, you might compare different models on each data split, but average their test errors across all splits. This way, you can use CV to decide which model size gives the best estimated performance. Finally, rerun your feature-selection algorithm on the full dataset, up to the size chosen by CV, and use this as the final feature set.

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Solved – Forward search feature selection and cross-validation

Related Question