Solved – feature selection on training and test data

it is clear that feature selection (FS) have to be done separately on training and then on test data to avoid overly optimistic results. Lets assume that I have training set and test data set. Also assume that I am using filter FS.

1. I do FS on training data, use data(features) selected by FS to train classifier (e.g. SVM: svm.train(X_train,y)). Assume that top 5 features selected by FS and used for training were: A,B, D, F, L (lets forget any parameter tuning for now).

I am not sure what the second step should be. There are two options.

2A. Apply FS on TEST data. In this case FS method can select different 5 top features e.g. A,B,C,D,E. Use this feature to test the model (e.g. y=svm.predict(X_test))

2B. From the TEST data we select exactly same features that were selected by FS in training stage (e.g A,B,D,F,L) and use this features to test the model (y=svm.predict(X_test)). In this step we apparently do not need to run FS algorithm, since we already know from step 1 which features we need to selct.

Which of these two approaches is correct?
Thanks.

Solved – feature selection on training and test data

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Should feature selection be performed only on training data (or all data)

Solved – How to divide feature set for selection and training

Related Question