Solved – Random Forest partial plot

partial plotrrandom forest

I have the following graph generated after I used the partial plot function of Random Forest in R.

enter image description here

Now from this image, my understanding says that temperature has a linear relationship with my variable at y axis (the one I am predicting) for values between 0 – approx. 25. For values between 25 – 100 the relationship is linear also, but the effect of temperature is less due to less steeper slope. For values above 100, the temperature seems to have no effect on the variable I am predicting.

Is this the correct explaination, or is there more to it then simply this.

Best Answer

Something like that would be my starting assumption, and for many practical examples you would be unlucky, if it turned out to be very wrong. But...

Noise: The more noise, the more conservative predictions(regression towards the mean) the RF will yield. This will introduce a bias, generally reducing the amplitude/steapness of a given partial plot. This should be regarded as a feature, not a bug. Thus the upper flatness, can also be due to few samples and more noise.

Interactions: Partial plotting of the higher dimensional topology of the trained RF model, is suitable only, when there is no dominant interactions with this specific variable. In the extreme case a variable can be highly important, but have a near flat partial function or you could end up with a Simpsons Paradox http://en.wikipedia.org/wiki/Simpson%27s_paradox.

Sample density: Alternatively you could more crudely say overall that y = a log(x) + b . I would recommend to plot an overlay of the training samples. Otherwise it is hard to assess weather a given local 'blop' is most likely due to few samples and some noise or it is actually a sound trend, which deserves to be described in detail.

Did the model use the specific variable much?: If the variable importance of this variable is very low, that would often mean that this variable have not been used much in the trees of the forest. Therefore the reproducibility of the partial function could become more unstable and the pratial function could become more crude. This could happen for noisy environments, sparse environments. It helps a little to lower mtry, such that less superior variables are used more.

Lastly a link to similar question I answered with some code examples for R randomForest: R: What do I see in partial dependence plots of gbm and RandomForest?