Solved – In the R randomForest package for random forest feature selection, how is the dataset split for training and testing

accuracyimportancerrandom forest

I'm using the randomForest R package to perform a random forest feature selection. I undestand that, after the execution of the randomForest function, I have to check the importance field, and study the importance measured throudh mean square error accuracy reduction and Gini purity reduction.

For example, by using this R code:

data(iris)
library("randomForest")
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
print(iris.rf)

It will print out:

                     MeanDecreaseAccuracy  MeanDecreaseGini
        Sepal.Width  0.007962441             2.625413
        Sepal.Length 0.031901722            10.714741
        Petal.Length 0.304760304            42.104241
        Petal.Width  0.300907912            43.767952

Then, I see that I can rank these features by the MeanDecreaseAccuracy or MeanDecreaseGini field to understand what are the most important ones.
Even if I understand the output of the method, I cannot understand how the method obtains its results.

The questions are:

  1. How does the method compute the accuracy?
  2. How does the method split the dataset into training set and test set to compute the accuracy?

Best Answer

It does not use a separate training and testing set. Instead, standard accuracy estimation in random forests takes advantage of an important feature: bagging, or bootstrap aggregation.

To construct a random forest, a large number of data subsets are generated by sampling with replacement from the full dataset. A separate decision tree is fit to each bootstrap data subset, the trees jointly forming the random forest. Each data point from the full dataset is present in approximately 2/3 of the bootstrap data subsets, and absent from the remaining 1/3. You can therefore use the 1/3 of trees that do not contain a point to predict what their value would be; these are called out-of-bag (OOB) estimates. This process avoids the overfitting problem (and arguably makes crossvalidation redundant for this purpose) since the points were not present in the trees used to predict them. By repeating this for every point in the full dataset and comparing the OOB predictions against the true values, you can calculate the accuracy of the random forest.

The mean decrease in accuracy metric (generally recommended) for a variable is calculated by permuting the values of this variable across the entire dataset and estimating how the accuracy of the random forest changes.

The mean decrease in Gini metric is explained this way by Breiman & Cutler (which I took from this helpful answer):

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.