Solved – GBM: How to interpret relative variable influence

boostingmachine learningr

I recently used the gbm package in RStudio for my analyis. All worked well. But I struggle to understand the summary of the model.

How to interpret the relative Influence of the variables? I can't find a definite answer to this question anywhere.

In this article from Towards Data Science I found this blurry description:

An important feature in the gbm modelling is the Variable Importance. Applying the summary function to a gbm output produces both a Variable Importance Table and a Plot of the model. This table below ranks the individual variables based on their relative influence, which is a measure indicating the relative importance of each variable in training the model.

I got even more confused, when I read this paper about gradient boosting machines. They state in Chapter 5.1

Influences do not provide any explanations about how the variable actually affects the response. The resulting influences can then be used for both forward and backwards feature selection procedures.

Let's get specific with a small example:

Assume a model with 4 explanatory variables. The gbm-model calculates relative importances as follows:

  • variable1: 0.5
  • variable2: 0.2
  • variable3: 0.2
  • variable4: 0.1

First of all, the model calculated no zero influence variables. that means, all variables are necessary.(?)

Can one say, variable1 explains 50% of variance?

Which statements can be made on the basis of this information?

Best Answer

These do not refer to the variance. They are two main approaches in variable importance measures for GBMs.

The first (per Breiman (2001) for example) the importance of a predictor is represented by the average increase in prediction error when a given predictor is shuffled (permuted). This is similar to what random forests are doing and is commonly referred as "permutation importance". It is common to normalise the variables in some way by other having them add up to 1 (or 100) or just assume that the most important variable has importance 1 (or 100). In the case you mention it seems that the normalisation is done so they add up to 1.

The second way that (e.g. used in sklearn) is to traverse the tree and record how much a given metric (e.g. MSE) changes every time a given variable is used for splitting. We get the average reductions across all base-learners for each variable, normalise it and we are good to go. Usually this is what is mentioned are "relative importance" and it considerable faster than the permutation importance. This is the one described in the linked 2013 paper by Natekin & Knoll.

Some further comments relating directly to the questions mentioned:

  1. All the features have "some importance". Given that the feature importance is not zero we can say that it is unnecessary.
  2. In the example shown, variable1 does not "explain 50% of variance". It is the most importance feature and accounts for 50% of the reduction to the loss function given this set of features. If the model's overall fit is miniscule, even if a certain feature has high relative importance, that feature's relative importance does not mean much. Half of almost nothing, is still almost nothing.
  3. The statements made have to be in relation to the particular application. To that respect if we want to focus to a particular feature it may worth to doing a manual permutation importance ourselves and then report the overall loss reduction. (e.g. "While the original model $X$ using feature $Y$ has loss $Z$, then model trained with the permuted feature $Y^*$ has loss $10Z$. etc. etc.")
Related Question