Solved – Random Forest % Var explained OOB output differs from test dataset results

model selectionrrandom forestregression

I am learning how to use Random Forest in R for regression based on the Boston dataset. I am unsure on which values I should concentrate to evaluate the obtained model, the OOB % Var explained and MSE of the model output, or the results I obtain applying the random forest model to a validation set.

In a first step I split the Boston dataset in a Training and a Validation set

require(randomForest)
require(MASS)
attach(Boston)

set.seed(100)
train <- sample(nrow(Boston), 0.7*nrow(Boston), replace = FALSE)
TrainSet <- Boston[train,]
ValidSet <- Boston[-train,]

Then perform random forest based on the TrainSet

set.seed(100)
Boston.rf <- randomForest(medv ~ ., mtry=6, data = TrainSet, importance = TRUE)
Boston.rf

Call:
 randomForest(formula = medv ~ ., data = TrainSet, mtry = 6, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 11.23613
                    % Var explained: 86.98

In a next step I use the obtained model to predict the variable in the independent validation set and use the results to obtain r-square and the MSE and RMSE of the validation set.

predvalidSet <- predict(Boston.rf,ValidSet)
# merge data for regression
totaltest <- cbind(ValidSet,predvalidSet)

reg <-lm(medv~predvalidSet, data=totaltest)
summary(reg)

Call:
lm(formula = medv ~ predvalidSet, data = totaltest)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7753  -1.1570   0.1062   1.7037   7.2186 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.87664    0.69482  -2.701  0.00771 ** 
predvalidSet  1.06294    0.02916  36.448  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.861 on 150 degrees of freedom
Multiple R-squared:  0.8985,    Adjusted R-squared:  0.8979 
F-statistic:  1328 on 1 and 150 DF,  p-value: < 2.2e-16

#Mean squared error test
MSE <- RSS / length(reg$residuals)
MSE
# 7.085578

#root mean squared error
RMSE.valid<- sqrt(mean(reg$residuals^2))
RMSE.valid
# 2.841671

The r-square value is higher and MSE and RMSE are lower in the validation set than the output of the random forest model directly (the OOB % Var explained and MSE).

In general, what values should I choose to evaluate the model and its predictive ability?
I tend towards using the values obtained based on the validation set.

Thanks in advance!

Best Answer

Your way to calculate the accuracy on the validation data set is not 100% appropriate as your linear regression estimates an intercept (unequal to 0) and a slope (unequal to 1). The way to go is to calculate the out-of-sample residuals as the difference between observed and predicted (from the random forest) and calculate the metrics of interest by hand. Put diffenently, just skip the part involving a linear regression.

So following your code, you can obtain the "correct" validation results by something like

res <- ValidSet$medv - predvalidSet

# RMSE
sqrt(mean(res^2)) # 2.922392

# R-squared
1 - var(res) / var(ValidSet$medv) # 0.8953902

In this case, the out-of-bag R-squared is a bit lower than the one evaluated on the validation data. This is not a big deal as the Boston data set is small.