Solved – Feature selection for random Forest using rfcv in R package

rrandom forest

I am running randomForest regression. My observation number is 70 and independent variables is 122. For variable selection for the final model, I combined rfcv() function and Importance(randomForest) function. For example, if the model with six variables has lowest OOB error based on rfcv. Then the final model will have the six most importance variable based on Importance function.

I wonder that whether my approach is appropriate or not?

Best Answer

Using the cv values from randomforest is a valid approach to feature selection, if you've selected sensible settings for your oob sampling.

However, with only 70 observations you're not going to get great results. For one, it's really hard generate enough "folds" to get a meaningful validation number while still having enough data to build the forest.

Random forests work best when there are a lot more than 70 rows to sample from - I don't know how many trees you're using (and at what depth), but you'll quickly run out of unique 'draws' at each decision point and you'll see suboptimal performance.

The real challenge of your situation is how to do variable selection without many input rows. A traditional approach to this kind of situation would be a penalized linear regression, like ridge or lasso. These tend to perform best for variable selection in cases with very little input.

All this being said, given the very small amount of data you are working with, it's going to be really hard to get a good model even if you use the most sophisticated techniques.