I have ~800 continuous variables and a categorical response variable (disease/non-disease) and I have been using caret to classify disease based on the continuous variables.
I have used caret and divided my dataset into train and test (2/3 and 1/3 respectively) and I have used EN, RF, PLS and SVM for classification. I do get some OKish AUC for both train and test set (about 75%).
I then wanted to use some feature selection (rfe) in order to eliminate some variables of low importance/noise. I wanted some advice with respect to this.
-
I run rfe (e.g. rfe with rfFuncs) on the train dataset and then predict on the test. Is this OK? Or do you use rfe on the whole dataset?
Also, I have seen online people using rfe on a train dataset and then creating a new dataset based on the new rfe selected variables (e.g. 100 out of 800). Then they would use this new smaller dataset and run from the beginning a classifier e.g. Elastic Net as before (on the same train dataset and then predict on test). Would that be OK or would that result in overfitting? -
rfe with rfFuncs gives me very variable results depending on the seed I choose. How can I work around it?
Best Answer
For #1:
For #2, use more resamples.
Max