Solved – Compare R-squared from two different Random Forest models

I'm using the randomForest package in R to develop a random forest model to try to explain a continuous outcome in a "wide" dataset with more predictors than samples.

Specifically, I'm fitting one RF model allowing the procedure to select from a set of ~75 predictor variables that I think are important.

I'm testing how well that model predicts the actual outcome for a reserved testing set, using the approach posted here previously, namely,

… or in R:

1 - sum((y-predicted)^2)/sum((y-mean(y))^2)

But now I have an additional ~25 predictor variables that I can add. When using the set of ~100 predictors, the R² is higher. I want to test this statistically, in other words, when using the set of ~100 predictors, does the model test significantly better in testing data than the model fit using ~75 predictors. I.e., is the R² from testing the RF model fit on the full dataset significantly higher than the R² from testing the RF model on the reduced dataset.

This is important for me to test, because this is pilot data, and getting those extra 25 predictors was expensive, and I need to know whether I should pay to measure those predictors in a larger follow-up study.

I'm trying to think of some kind of resampling/permutation approach but nothing comes to mind.

Solved – Compare R-squared from two different Random Forest models

Best Answer

Related Question

Best Answer

Related Solutions

Solved – random forest classification in R – no separation in training set

Solved – How (not) to (over)fit a random forest in R

Related Question