Partial Dependence Plot – Interpretation of Y-Axis

boostinginterpretationmachine learningscikit learn

First off, I know there are many questions on this site similar to this one. I've read them, and have not been able to find a solution.

In Elements of Statistical Learning, the following figure shows partial dependence plots for California Housing Data:

Meaning of y-axis in partial dependence plot

The text defines partial dependence of $f(X)$ on $X_S$ as $f_S(X_S) = E_{X_C}f(X_S, X_C)$, the marginal average of $f$.

I'm wondering how to interpret the y-axis of these plots. Based on the definition, I would expect the y-axis to be the housing price, as the given x-axis variables vary, while accounting for the averages of all other variables. But that can't be the case, because the y-axis has negative values, and all values are in the range -1 to 2.

The scikit-learn documentation shows how to make the plots here: https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html#sphx-glr-auto-examples-inspection-plot-partial-dependence-py.

Other questions have asked specifically about the implementation in R for classification, which uses a logit, and explains the negative values. But I'm wondering about the regression case, as described in Elements.

Best Answer

What we are seeing are changes relative to an overall central tendency. This is easy to miss. The central tendency used is a bit of the author's choice.

In The Elements of Statistical Learning (2009), the caption of Fig. 10.17 particular mentions: "Partial dependence of median house value on location in California. One unit is \$100,000, at 1990 prices, and the values plotted are relative to the overall median of \$180,000." (emphasis mine).

Similarly in scikit-learn documentation in the "California Housing data preprocessing" we have the line: y -= y.mean(). This subtracts the mean target value from our target vector and centres our target variables to $0$; as a result our PDPs values will be centred approximately around $0$ too. (sklearn's fetch_california_housing() also states that the target is "in units of 100,000".)

In general, I prefer to show my PDPs uncentred at first instance because it makes it plainly clear how much (approximate) variability a particular feature has on the response but that is a matter of choice. As we see, Hastie et al. (2009) centred around the median, sklearn centred around the mean, while Friedman (2001) "Greedy function approximation: A gradient boosting machine" (Fig. 11) did not centre at all. Similar to Friedman (2001), Greenwell (2017) "pdp: An R Package for Constructing Partial Dependence Plots" also does not centre by default (Fig. 2 and others).