Solved – How to divide feature set for selection and training

feature selectionmachine learningsamplesvm

I have training data with 260 observations that have a total of 7 classes. Each observation has 120 features. I applied feature selection based on the Bhattacharyya Algorithm and got the top 40 features for each class.

I have two questions. I have done the feature selection on the whole set of observations and will do the training on some part of the data set (50%) and testing on the left out part of the data set. Is this method okay? Or should feature selection also be performed only on the training data?

Also once I have the top 40 features of each class how do I provide an SVM the selected feature set that has say, features 1,2 and 5 are important for class 1 and feature 1,2 and 6 are important for class 2. I am using matlab as an implementational tool. Thanks in advance.

Best Answer

The purpose of splitting your data into a train/test sets is to simulate the real world. You have a bunch of labeled data to train a classifier and want to see how it performs on data that was not used to train it. By using the labels from the test set to perform feature selection you've used information that would not be available, since you're pretending you don't know the labels on the test set, and biased your estimates.

Here's an extreme example. Suppose I use the following feature learning algorithm: map each data point to a binary indicator for the corresponding label. I do this using all the data. Now for any reasonable split of my data, any reasonable classifier is going to be able to figure this relationship out, and get 100% accuracy on the test set. Granted, my algorithm is for feature learning, not feature selection, but I think you'll see the point I'm trying to make.

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Solved – Using Adaboost for feature selection

When you use decision stumps as your weak classifier, AdaBoost will do feature selection explicitly. There could be other weak classifiers which won't let you select features easily.

I think you are complicating your training-testing protocol. Here is the most common scenario: A: training, B: validation, C: testing. You train on A, and adjust the parameters of your method (in AdaBoost's case, the number of weak classifiers to use) to maximize performance on B. After you selected the optimal parameters, you train on A+B, and test on C. A better way is to do k-fold cross validation on A+B.

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Solved – Using Adaboost for feature selection

Related Question