Solved – way to use cross validation to do variable/feature selection in R

cross-validationfeature selectionrrandom foreststepwise regression

I have a data set with about 70 variables that I'd like to cut down. What I'm looking to do is use CV to find most useful variables in the following fashion.

1) Randomly select say 20 variables.

2) Use stepwise/LASSO/lars/etc to choose most important variables.

3) Repeat ~50x and see which variables are selected (not eliminated) most frequently.

This is along the lines of what a randomForest would do, but the rfVarSel package seems to only work for factors/classification and I need to predict a continuous dependent variable.

I'm using R so any suggestions would ideally be implemented there.

Best Answer

I believe what you describe is already implemented in the caret package. Look at the rfe function or the vignette here: http://cran.r-project.org/web/packages/caret/vignettes/caretSelection.pdf

Now, having said that, why do you need to reduce the number of features? From 70 to 20 isn't really an order of magnitude decrease. I would think you'd need more than 70 features before you would have a firm prior believe that some of the features really and truly don't matter. But then again, that's where a subjective prior comes in I suppose.

Related Question