Solved – Partial Effects Plots vs. Partial Dependence Plots for Random Forests

interpretationmachine learningpartial plotrandom forestregression

One method for interpreting the relationship of a predictor X to the response variable Y in a fitted multivariate regression model is a Partial Effects (PE) Plot. This can be generated by holding other predictors constant at their mean or median (for continuous variables) or mode (for categorical variables), and plotting predictions of Y from the fitted model for various values of the predictor of interest X. (Frank Harrell's Biostatistical Modeling course notes have some information on how to do this: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/BioMod/notes.pdf)

From reading the literature on the interpretation of Machine Learning models such as Random Forests, it appears that one of the most widely used methods for interpreting relationships between predictors and responses in fitted ML models is Partial Dependence Plots (PDPs), introduced by Jerome Friedman in 2001. The idea is similar to a Partial Effects plot but is subtly different. A PDP plots the change in the average predicted Y as X varies over its marginal distribution. Skipping over the mathematics, this is generated by generating a prediction for Y for each observation in the training set for some (perturbed) value of X, keeping other predictors as is, then averaging the results; generating predictions for Y for each training observation for the next perturbed value of X, then averaging the results; and so on for the range of values of interest for X. We then plot the averaged predictions over the range of values of X. (See Friedman's original paper here https://statweb.stanford.edu/~jhf/ftp/trebst.pdf and the summary in Elements of Statistical Learning pp.369-70).

My question is the following: it appears there is nothing in principle preventing us from generating a PE plot from a fitted ML model – we would simply generate a dataset with varying X and other predictors held constant at their means/modes, and then generate/plot the ML model predictions. What is the specific motivation for using PDPs for ML models? Is there something illegitimate about generating a PE plot from an ML model such as a Random Forest? I suspect it has something to do with the non-linearities and interactions that are captured by a Random Forest's predicted values, but it would be nice to capture this intuition more precisely.

Best Answer

For linear models without categorical variables, if you are using the mean when computing the PE plot, then the PE plot is the same as the PDP. Intuitively, the PE plot is to take the average for other variables first and then plot a curve, where the slope is the beta. The PDP is to compute the values for every instance and then take an average, where the slope is also the beta (see https://cran.r-project.org/web/packages/datarobot/vignettes/PartialDependence.html).

However, for linear models with categorical variables, using the mode for categorical variables cannot generate a curve with a slope equal to beta (since using mode to approximate the "mean" of categorical variables is not accurate enough). But PDP still can. I think here is where the difference lies. Obviously, PDP seems better in this sense.

Related Solutions

Solved – R: What do I see in partial dependence plots of gbm and RandomForest

I spent some time writing my own "partial.function-plotter" before I realized it was already bundled in the R randomForest library.

[EDIT ...but then I spent a year making the CRAN package forestFloor, which is by my opinion significantly better than classical partial dependence plots]

Partial.function plot are great in instances as this simulation example you show here, where the explaining variable do not interact with other variables. If each explaining variable contribute additively to the target-Y by some unknown function, this method is great to show that estimated hidden function. I often see such flattening in the borders of partial functions.

Some reasons: randomForsest has an argument called 'nodesize=5' which means no tree will subdivide a group of 5 members or less. Therefore each tree cannot distinguish with further precision. Bagging/bootstrapping layer of ensemple smooths by voting the many step functions of the individual trees - but only in the middle of the data region. Nearing the borders of data represented space, the 'amplitude' of the partial.function will fall. Setting nodesize=3 and/or get more observations compared to noise can reduce this border flatting effect... When signal to noise ratio falls in general in random forest the predictions scale condenses. Thus the predictions are not absolutely terms accurate, but only linearly correlated with target. You can see the a and b values as examples of and extremely low signal to noise ratio, and therefore these partial functions are very flat. It's a nice feature of random forest that you already from the range of predictions of training set can guess how well the model is performing. OOB.predictions is great also..

flattening of partial plot in regions with no data is reasonable: As random forest and CART are data driven modeling, I personally like the concept that these models do not extrapolate. Thus prediction of c=500 or c=1100 is the exactly same as c=100 or in most instances also c=98.

Here is a code example with the border flattening is reduced:

I have not tried the gbm package...

here is some illustrative code based on your eaxample...

#more observations are created...
a <- runif(5000, 1, 100)
b <- runif(5000, 1, 100)
c <- (1:5000)/50 + rnorm(100, mean = 0, sd = 0.1)
y <- (1:5000)/50 + rnorm(100, mean = 0, sd = 0.1)
par(mfrow = c(1,3))
plot(y ~ a); plot(y ~ b); plot(y ~ c)
Data <- data.frame(matrix(c(y, a, b, c), ncol = 4))
names(Data) <- c("y", "a", "b", "c")
library(randomForest)
#smaller nodesize "not as important" when there number of observartion is increased
#more tress can smooth flattening so boundery regions have best possible signal to             noise, data specific how many needed

plot.partial = function() {
partialPlot(rf.model, Data[,2:4], x.var = "a",xlim=c(1,100),ylim=c(1,100))
partialPlot(rf.model, Data[,2:4], x.var = "b",xlim=c(1,100),ylim=c(1,100))
partialPlot(rf.model, Data[,2:4], x.var = "c",xlim=c(1,100),ylim=c(1,100))
}

#worst case! : with 100 samples from Data and nodesize=30
rf.model <- randomForest(y ~ a + b + c, data = Data[sample(5000,100),],nodesize=30)
plot.partial()

#reasonble settings for least partial flattening by few observations: 100 samples and nodesize=3 and ntrees=2000
#more tress can smooth flattening so boundery regions have best possiblefidelity
rf.model <- randomForest(y ~ a + b + c, data = Data[sample(5000,100),],nodesize=5,ntress=2000)
plot.partial()

#more observations is great!
rf.model <- randomForest(y ~ a + b + c,
 data = Data[sample(5000,5000),],
 nodesize=5,ntress=2000)
plot.partial()

Solved – interpreting y axis of a partial dependence plots

Each point on the partial dependence plot is the average vote percentage in favor of the "Yes trees" class across all observations, given a fixed level of TRI.

It's not a probability of correct classification. It has absolutely nothing to do with accuracy, true negatives, and true positives.

When you see the phrase

Values greater than TRI 30 begin to have a positive influence for classification in your model

is an puffed-up way of saying

Values greater than TRI 30 begin to predict "Yes trees" more strongly than values lower than TRI 30

Best Answer

Related Solutions

Solved – R: What do I see in partial dependence plots of gbm and RandomForest

Solved – interpreting y axis of a partial dependence plots

Related Question