Solved – Random Forest plot Interpretation in R

data visualizationmachine learningrrandom forest

I am analyzing data (which I am unable to share), and created several classification models between four classes using the randomForest() function. They are fairly successful – in this example, when fitted on the test set, overall achieved accuracy rate is above 0.88, with each class having an accuracy rate above 0.86.

Attempting to use the plot() function on these models, I always get graphs similar to the one pictured below – similar in that it seems to be that there is always at some point in the graph an error rate of 1.

I thought that this could be an accuracy rate, after all the model has accuracy of 0.95 for 'FVP', but that implies 'Normal' has an accuracy rate of about 0.35, which is not even close.

How do I interpret this graph? If the code for this plot function is bugged, what can I use to visualize anything about the randomForest()object?

Best Answer

This plot, without xtest and ytest arguments, shows OOB Error Rates, which can differ dramatically from legitimate test set Error Rates.

Related Solutions

Solved – Random Forest partial plot

Something like that would be my starting assumption, and for many practical examples you would be unlucky, if it turned out to be very wrong. But...

Noise: The more noise, the more conservative predictions(regression towards the mean) the RF will yield. This will introduce a bias, generally reducing the amplitude/steapness of a given partial plot. This should be regarded as a feature, not a bug. Thus the upper flatness, can also be due to few samples and more noise.

Interactions: Partial plotting of the higher dimensional topology of the trained RF model, is suitable only, when there is no dominant interactions with this specific variable. In the extreme case a variable can be highly important, but have a near flat partial function or you could end up with a Simpsons Paradox http://en.wikipedia.org/wiki/Simpson%27s_paradox.

Sample density: Alternatively you could more crudely say overall that y = a log(x) + b . I would recommend to plot an overlay of the training samples. Otherwise it is hard to assess weather a given local 'blop' is most likely due to few samples and some noise or it is actually a sound trend, which deserves to be described in detail.

Did the model use the specific variable much?: If the variable importance of this variable is very low, that would often mean that this variable have not been used much in the trees of the forest. Therefore the reproducibility of the partial function could become more unstable and the pratial function could become more crude. This could happen for noisy environments, sparse environments. It helps a little to lower mtry, such that less superior variables are used more.

Lastly a link to similar question I answered with some code examples for R randomForest: R: What do I see in partial dependence plots of gbm and RandomForest?

Random Forest – How to Interpret Variable Importance Plot in Random Forests Using R

Ok so the first plot does not reflect % drop in accuracy but rather, the mean change in accuracy scaled by its standard deviation. This is where the change in accuracy is stored, unscaled, note the MeanDecreaseAccuracy is the average of columns 1 and 2:

wine.bag$importance
                             0          1 MeanDecreaseAccuracy MeanDecreaseGini
alcohol             0.04666892 0.22738424           0.08223163         352.1256
volatile_acidity    0.02050844 0.11063939           0.03823661         195.8936
sulphates           0.01447296 0.07839553           0.02705122         182.4080
residual_sugar      0.02873093 0.08038513           0.03888946         187.5240
chlorides           0.01957198 0.11556222           0.03845305         197.1288

When you scale it by SD, you get the numbers you see in the plot:

wine.bag$importance[,1:3]/wine.bag$importanceSD[,1:3]
                           0        1 MeanDecreaseAccuracy
alcohol             61.36757 83.93440            107.08224
volatile_acidity    48.13822 75.60551             83.95987
sulphates           43.27217 66.92138             73.31890
residual_sugar      53.55621 53.29963             73.45684

The decrease in accuracy is measured by permuting the values of the predictor in the out-of-bag samples and calculating the corresponding decrease. You do this for each tree over all its corresponding OOB samples to get the mean and SD. It is also discussed in this post

This importance score gives an indication of how useful the variables are for prediction. You can visualize them like this, where you see for example alcohol is quite different in the two classes, as opposed to fixed_acidity:

par(mfrow=c(1,2))
boxplot(fixed_acidity~quality01,data=wine)
boxplot(alcohol~quality01,data=wine)

Gini is another way of looking at the predictive power of your variables (check also explanation on Gini), and difference you see is due to the fact that Gini is measured across all trees whereas MDA is calculated separately for each class.

Sometimes these importance measures are used when we want to know more about the variables associated with the response, after modeling the data. If interested yo u can check out section 11 of this initial paper by Breiman.

Best Answer

Related Solutions

Solved – Random Forest partial plot

Random Forest – How to Interpret Variable Importance Plot in Random Forests Using R

Related Question