Solved – caret rfe variable selection and test prediction

caretr

I have ~800 continuous variables and a categorical response variable (disease/non-disease) and I have been using caret to classify disease based on the continuous variables.

I have used caret and divided my dataset into train and test (2/3 and 1/3 respectively) and I have used EN, RF, PLS and SVM for classification. I do get some OKish AUC for both train and test set (about 75%).

I then wanted to use some feature selection (rfe) in order to eliminate some variables of low importance/noise. I wanted some advice with respect to this.

  1. I run rfe (e.g. rfe with rfFuncs) on the train dataset and then predict on the test. Is this OK? Or do you use rfe on the whole dataset?
    Also, I have seen online people using rfe on a train dataset and then creating a new dataset based on the new rfe selected variables (e.g. 100 out of 800). Then they would use this new smaller dataset and run from the beginning a classifier e.g. Elastic Net as before (on the same train dataset and then predict on test). Would that be OK or would that result in overfitting?

  2. rfe with rfFuncs gives me very variable results depending on the seed I choose. How can I work around it?

Best Answer

For #1:

  • Using feature selection on the training set and predicting the test set is fine.
  • You could use the selected set with other models. However, these predictors may not work well for other models. Second, the performance of the following models will be estimated to be inappropriately optimistic (Since the next model training does not know that the features were selected).
  • The elastic net does its own feature selection, so why would you use that model after selection?

For #2, use more resamples.

Max