Solved – How to divide feature set for selection and training

feature selectionmachine learningsamplesvm

I have training data with 260 observations that have a total of 7 classes. Each observation has 120 features. I applied feature selection based on the Bhattacharyya Algorithm and got the top 40 features for each class.

I have two questions. I have done the feature selection on the whole set of observations and will do the training on some part of the data set (50%) and testing on the left out part of the data set. Is this method okay? Or should feature selection also be performed only on the training data?

Also once I have the top 40 features of each class how do I provide an SVM the selected feature set that has say, features 1,2 and 5 are important for class 1 and feature 1,2 and 6 are important for class 2. I am using matlab as an implementational tool. Thanks in advance.

Best Answer

The purpose of splitting your data into a train/test sets is to simulate the real world. You have a bunch of labeled data to train a classifier and want to see how it performs on data that was not used to train it. By using the labels from the test set to perform feature selection you've used information that would not be available, since you're pretending you don't know the labels on the test set, and biased your estimates.

Here's an extreme example. Suppose I use the following feature learning algorithm: map each data point to a binary indicator for the corresponding label. I do this using all the data. Now for any reasonable split of my data, any reasonable classifier is going to be able to figure this relationship out, and get 100% accuracy on the test set. Granted, my algorithm is for feature learning, not feature selection, but I think you'll see the point I'm trying to make.