Solved – Interpreting Partial Dependence Plots in GBM

boostingpartial plotr

I am using gbm to fit a model and partial dependence plots to interpret parts of the model.

There seems to be some differences between partial dependence plots for gbm and the extension dismo – but that might be saved for a different post.

Anyways, when using gbm I am having a hard time interpreting what the y-axis is.
See image below.

To give a little bit of background – I am building a model with Biomass as my response over ~ 30 predictors.

Looking at this image – I thought that the y-axis was the response (biomass) when only FvFm is being used in the model. So this would mean that lower FvFm values are usually indicative of higher biomass. However, this is incorrect – as we would expect there to be lower biomass with lower FvFm (especially at values < 0.3) and higher biomass at medium to higher values of FvFm.

My question would be: How can I interpret the y-axis? Is it actually my response variable or does it have different meaning?

enter image description here

Best Answer

?plot.gbm doesn't ignore the effects of the other predictors, rather it

Plots the marginal effect of the selected variables by "integrating" out the other variables.

You may wish to review section 8.2 of Friedman (2001), which is cited by ?plot.gbm when it states:

plot.gbm produces low dimensional projections of the gbm.object by integrating out the variables not included in the i.var argument. The function selects a grid of points and uses the weighted tree traversal method described in Friedman (2001) to do the integration.

Looking at the bottom of p. 27 of the paper ("For regression trees based on single-variable splits..."), I think a partial dependency plot of $y$ on $x$ can be simplified as the tree traversal:

  • Create a sequence of values for $x$: $x_1$ through $x_r$ (see the parameter continuous.resolution).

  • For each value of $x_i$ in this sequence:

    • Start at the root node of the tree
    • Set weight $w$ = 1
    • Traverse tree as follows:

      • If the node is a split on $x$, take the branch corresponding to the value $x_i$ and keep $w$ unchanged.
      • If the node is a split on a variable other than $x$, visit both child nodes:

        • Left child gets weight: $w_{left}$ = $w_{parent} * p_{left}$
        • Right child gets weight: $w_{left}$ = $w_{parent} * p_{right}$
        • where $p_{left}$ is the proportion of the training observations at the parent node that continue to the left child, similarly for right
      • If the node is a terminal node, return $w_{term} * f(x_i)$ , where $f(x_i)$ is the predicted value* at that terminal node.

    • Return $\bar{F}(x_i)$ which is simply the sum of returned weighted values from the terminal nodes.
  • Repeat for each of $M$ trees, then calculate the average $\bar{F}(x_1)$ through $\bar{F}(x_r)$ :

  • Plot this average function

*The predicted value is what appears on the y-axis.

Since you have a continuous outcome variable (biomass), this predicted value is your response variable.

However - if you had a Bernoulli or Poisson outcome variable, you would need to use type="response" to plot the response variable. (See ?predict.gbm)