I am learning how to use Random Forest in R for regression based on the Boston dataset. I am unsure on which values I should concentrate to evaluate the obtained model, the OOB % Var explained and MSE of the model output, or the results I obtain applying the random forest model to a validation set.
In a first step I split the Boston dataset in a Training and a Validation set
require(randomForest)
require(MASS)
attach(Boston)
set.seed(100)
train <- sample(nrow(Boston), 0.7*nrow(Boston), replace = FALSE)
TrainSet <- Boston[train,]
ValidSet <- Boston[-train,]
Then perform random forest based on the TrainSet
set.seed(100)
Boston.rf <- randomForest(medv ~ ., mtry=6, data = TrainSet, importance = TRUE)
Boston.rf
Call:
randomForest(formula = medv ~ ., data = TrainSet, mtry = 6, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 11.23613
% Var explained: 86.98
In a next step I use the obtained model to predict the variable in the independent validation set and use the results to obtain r-square and the MSE and RMSE of the validation set.
predvalidSet <- predict(Boston.rf,ValidSet)
# merge data for regression
totaltest <- cbind(ValidSet,predvalidSet)
reg <-lm(medv~predvalidSet, data=totaltest)
summary(reg)
Call:
lm(formula = medv ~ predvalidSet, data = totaltest)
Residuals:
Min 1Q Median 3Q Max
-16.7753 -1.1570 0.1062 1.7037 7.2186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.87664 0.69482 -2.701 0.00771 **
predvalidSet 1.06294 0.02916 36.448 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.861 on 150 degrees of freedom
Multiple R-squared: 0.8985, Adjusted R-squared: 0.8979
F-statistic: 1328 on 1 and 150 DF, p-value: < 2.2e-16
#Mean squared error test
MSE <- RSS / length(reg$residuals)
MSE
# 7.085578
#root mean squared error
RMSE.valid<- sqrt(mean(reg$residuals^2))
RMSE.valid
# 2.841671
The r-square value is higher and MSE and RMSE are lower in the validation set than the output of the random forest model directly (the OOB % Var explained and MSE).
In general, what values should I choose to evaluate the model and its predictive ability?
I tend towards using the values obtained based on the validation set.
Thanks in advance!
Best Answer
Your way to calculate the accuracy on the validation data set is not 100% appropriate as your linear regression estimates an intercept (unequal to 0) and a slope (unequal to 1). The way to go is to calculate the out-of-sample residuals as the difference between observed and predicted (from the random forest) and calculate the metrics of interest by hand. Put diffenently, just skip the part involving a linear regression.
So following your code, you can obtain the "correct" validation results by something like
In this case, the out-of-bag R-squared is a bit lower than the one evaluated on the validation data. This is not a big deal as the Boston data set is small.