Solved – In the R randomForest package for random forest feature selection, how is the dataset split for training and testing

accuracyimportancerrandom forest

I'm using the randomForest R package to perform a random forest feature selection. I undestand that, after the execution of the randomForest function, I have to check the importance field, and study the importance measured throudh mean square error accuracy reduction and Gini purity reduction.

For example, by using this R code:

data(iris)
library("randomForest")
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
print(iris.rf)

It will print out:

                     MeanDecreaseAccuracy  MeanDecreaseGini
        Sepal.Width  0.007962441             2.625413
        Sepal.Length 0.031901722            10.714741
        Petal.Length 0.304760304            42.104241
        Petal.Width  0.300907912            43.767952

Then, I see that I can rank these features by the MeanDecreaseAccuracy or MeanDecreaseGini field to understand what are the most important ones.
Even if I understand the output of the method, I cannot understand how the method obtains its results.

The questions are:

How does the method compute the accuracy?
How does the method split the dataset into training set and test set to compute the accuracy?

Best Answer

It does not use a separate training and testing set. Instead, standard accuracy estimation in random forests takes advantage of an important feature: bagging, or bootstrap aggregation.

To construct a random forest, a large number of data subsets are generated by sampling with replacement from the full dataset. A separate decision tree is fit to each bootstrap data subset, the trees jointly forming the random forest. Each data point from the full dataset is present in approximately 2/3 of the bootstrap data subsets, and absent from the remaining 1/3. You can therefore use the 1/3 of trees that do not contain a point to predict what their value would be; these are called out-of-bag (OOB) estimates. This process avoids the overfitting problem (and arguably makes crossvalidation redundant for this purpose) since the points were not present in the trees used to predict them. By repeating this for every point in the full dataset and comparing the OOB predictions against the true values, you can calculate the accuracy of the random forest.

The mean decrease in accuracy metric (generally recommended) for a variable is calculated by permuting the values of this variable across the entire dataset and estimating how the accuracy of the random forest changes.

The mean decrease in Gini metric is explained this way by Breiman & Cutler (which I took from this helpful answer):

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

Related Solutions

Solved – randomForest and variable importance bug

No, this isn't a bug. The values given in fit$importance are unscaled, while the values given by importance(fit) are expressed in terms of standard deviations (as given by fit$importanceSD). This is usually a more meaningful measure. If you want the "raw" values, you can use importance(fit, scale=FALSE).

In general, it's a very bad idea to rely on the internal details of a fit object, when there's an extractor function provided. There are no guarantees as to the contents of fit$importance - they could change drastically from version to version without notice. You should always use the extractor function when it's provided.

Edit: Yes, that line in rfcv() does look like a bug, or at least unintended behaviour. It's actually quite a good example of why you shouldn't rely on the contents of things like fit$importance. If the fit is for a regression forest, the first column of fit$importance is %IncMSE, equivalent to importance(fit, type=1). However, this doesn't hold in the classification case, where you have extra columns for each factor level.

Solved – Visualizing SVM results

Usually a dimension reduction technique is employed to visualize fit on many variables.

Usually again SVD is used to reduce dimensions and keep 2 components, and visualize.

Here's how it might look like - enter image description here

Note that the x and y axes are the top 2 components of the SVD decomposition.

I haven't used R much lately, so I used python for creating the picture above.

from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC
from sklearn.datasets import load_iris

# To visualize the actual data in top 2 dimensions
iris=load_iris()
x,y=iris.data,iris.target

model=SVC().fit(x,y)
predicted=model.predict(x)

svd=TruncatedSVD().fit_transform(x)

from matplotlib import pyplot as plt
plt.figure(figsize=(16,6))
plt.subplot(1,2,0)
plt.title('Actual data, with errors highlighted')
colors=['r','g','b']
for t in [0,1,2]:
    plt.plot(svd[y==t][:,0],svd[y==t][:,1],colors[t]+'+')

errX,errY=svd[predicted!=y],y[predicted!=y]
for t in [0,1,2]:
    plt.plot(errX[errY==t][:,0],errX[errY==t][:,1],colors[t]+'o')


# To visualize the SVM classifier across
import numpy as np
density=15
domain=[np.linspace(min(x[:,i]),max(x[:,i]),num=density*4 if i==2 else density) for i in range(4)]

from itertools import product
allxs=list(product(*domain))
allys=model.predict(allxs)

allxs_svd=TruncatedSVD().fit_transform(allxs)

plt.subplot(1,2,1)
plt.title('Prediction space reduced to top two SVD\'s')
plt.ylim(-3,3)
for t in [0,1,2]:
    plt.scatter(allxs_svd[allys==t][:,0],allxs_svd[allys==t][:,1],color=colors[t],alpha=0.2/density,edgecolor='None')

Best Answer

Related Solutions

Solved – randomForest and variable importance bug

Solved – Visualizing SVM results

Related Question