Solved – feature selection on training and test data

cross-validationfeature selectionmachine learningmethodologypredictive-models

it is clear that feature selection (FS) have to be done separately on training and then on test data to avoid overly optimistic results. Lets assume that I have training set and test data set. Also assume that I am using filter FS.

1. I do FS on training data, use data(features) selected by FS to train classifier (e.g. SVM: svm.train(X_train,y)). Assume that top 5 features selected by FS and used for training were: A,B, D, F, L (lets forget any parameter tuning for now).

I am not sure what the second step should be. There are two options.

2A. Apply FS on TEST data. In this case FS method can select different 5 top features e.g. A,B,C,D,E. Use this feature to test the model (e.g. y=svm.predict(X_test))

2B. From the TEST data we select exactly same features that were selected by FS in training stage (e.g A,B,D,F,L) and use this features to test the model (y=svm.predict(X_test)). In this step we apparently do not need to run FS algorithm, since we already know from step 1 which features we need to selct.

Which of these two approaches is correct?
Thanks.

Best Answer

Definitely 2B. Your model is built with feature A,B,D,F,L and wouldn't be able to interpret the other features appropriately. Additionally, I would recommend you to look into Nested Cross-validation (Applying FS when cross-validating).