Solved – how to calculate partial dependence when I have 4 predictors

partial-effectrandom forest

I was reading Freidman's book "The elements of statistical learning-2nd edition". Page 365, it talks about partial dependence plot. I don't quite understand how he actually calculates partial depence of f(X) on Xs. Say, I built a model on 4 predictors; I use the model itself as f(Xs,Xc). I want to know the partial dependence of my first predictor V1. If the data looks like :

V1 V2 V3 V4 predicted_p

1 1 0 1 0.2

1 2 0 0 0.24

2 1 1 1 0.6

2 2 0 1 0.4

Is the partial dependence for V1 has two values, one for V1=1, the other for V1=2?
For case V1=1, PD=(0.2+0.24)/2=0.22; for case V1=2, PD=(0.6+0.4)/2=0.5

Do I understand right? Thank you.

Besides, are the mean of partial dependence for each predictor equal?

Best Answer

Suppose that we have a data set $X = [x_s \, x_c] \in \mathbb R^{n \times p}$ where $x_s$ is a matrix of variables we want to know the partial dependencies for and $x_c$ is a matrix of the remaining predictors. Let $y \in \mathbb R$ be a vector of responses (i.e. a regression problem). Suppose that $y = f(x) + \epsilon$ and we estimate some fit $\hat f$.

Then $\hat f_s (x)$, the partial dependence of $\hat f$ at $x$ (here $x$ lives in the same space as $x_s$), is defined as:

$$\hat f_s(x) = {1 \over n} \sum_{i=1}^n \hat f(x, x_{c_i})$$

This says: hold $x$ constant for the variables of interest and take the average prediction over all other combinations of other variables in the training set. So we need to pick variables of interest, and also to pick a region of the space that $x_s$ lives in that we are interested in. Note: be careful extrapolating the marginal mean of $f(x)$ outside of this region.

Here's an example implementation in R. We start by creating an example dataset:

library(tidyverse)
library(ranger)
library(broom)

mt2 <- mtcars %>%
  as_tibble() %>%
  select(hp, mpg, disp, wt, qsec)

Then we estimate $f$ using a random forest:

fit <- ranger(hp ~ ., mt2)

Next we pick the feature we're interested in estimating partial dependencies for:

var <- quo(disp)

Now we can split the dataset into this predictor and other predictors:

x_s <- select(mt2, !!var)   # grid where we want partial dependencies
x_c <- select(mt2, -!!var)  # other predictors

Then we create a dataframe of all combinations of these datasets:

# if the training dataset is large, use a subsample of x_c instead
grid <- crossing(x_s, x_c)

We want to know the predictions of $\hat f$ at each point on this grid. I define a helper in the spirit of broom::augment() for this:

augment.ranger <- function(x, newdata) {
  newdata <- as_tibble(newdata)
  mutate(newdata, .fitted = predict(x, newdata)$predictions)
}

au <- augment(fit, grid)

Now we have the predictions and we marginalize by taking the average for each point in $x_s$:

pd <- au %>%
  group_by(!!var) %>%
  summarize(yhat = mean(.fitted))

We can visualize this as well:

pd %>%
  ggplot(aes(!!var, yhat)) +
  geom_line(size = 1) +
  labs(title = "Partial dependence plot for displacement",
       y = "Average prediction across all other predictors",
       x = "Engine displacement") +
  theme_bw()

Finally, we can check this implementation against the pdp package to make sure it's correct:

pd2 <- pdp::partial(
  fit,
  pred.var = quo_name(var),
  pred.grid = distinct(mtcars, !!var),
  train = mt2
)

testthat::expect_equivalent(pd, pd2)  # silent, so we're good

For a classification problem, you can repeat a similar procedure, except predicting the class probability for a single class instead.