Solved – Compare R-squared from two different Random Forest models

hypothesis testingmachine learningmodel selectionrrandom forest

I'm using the randomForest package in R to develop a random forest model to try to explain a continuous outcome in a "wide" dataset with more predictors than samples.

Specifically, I'm fitting one RF model allowing the procedure to select from a set of ~75 predictor variables that I think are important.

I'm testing how well that model predicts the actual outcome for a reserved testing set, using the approach posted here previously, namely,

… or in R:

1 - sum((y-predicted)^2)/sum((y-mean(y))^2)

But now I have an additional ~25 predictor variables that I can add. When using the set of ~100 predictors, the R² is higher. I want to test this statistically, in other words, when using the set of ~100 predictors, does the model test significantly better in testing data than the model fit using ~75 predictors. I.e., is the R² from testing the RF model fit on the full dataset significantly higher than the R² from testing the RF model on the reduced dataset.

This is important for me to test, because this is pilot data, and getting those extra 25 predictors was expensive, and I need to know whether I should pay to measure those predictors in a larger follow-up study.

I'm trying to think of some kind of resampling/permutation approach but nothing comes to mind.

Best Answer

Cross-validate! Use the train function in caret to fit your 2 models. Use one value of mtry (the same for both models). Caret will return a re-sampled estimate of RMSE and $R^2$.

See page 3 of the caret vignette (also in the full reference manual)

Related Question