Solved – GBM: How to interpret relative variable influence

boostingmachine learningr

I recently used the gbm package in RStudio for my analyis. All worked well. But I struggle to understand the summary of the model.

How to interpret the relative Influence of the variables? I can't find a definite answer to this question anywhere.

In this article from Towards Data Science I found this blurry description:

An important feature in the gbm modelling is the Variable Importance. Applying the summary function to a gbm output produces both a Variable Importance Table and a Plot of the model. This table below ranks the individual variables based on their relative influence, which is a measure indicating the relative importance of each variable in training the model.

I got even more confused, when I read this paper about gradient boosting machines. They state in Chapter 5.1

Influences do not provide any explanations about how the variable actually affects the response. The resulting influences can then be used for both forward and backwards feature selection procedures.

Let's get specific with a small example:

Assume a model with 4 explanatory variables. The gbm-model calculates relative importances as follows:

variable1: 0.5
variable2: 0.2
variable3: 0.2
variable4: 0.1

First of all, the model calculated no zero influence variables. that means, all variables are necessary.(?)

Can one say, variable1 explains 50% of variance?

Which statements can be made on the basis of this information?

Best Answer

These do not refer to the variance. They are two main approaches in variable importance measures for GBMs.

The first (per Breiman (2001) for example) the importance of a predictor is represented by the average increase in prediction error when a given predictor is shuffled (permuted). This is similar to what random forests are doing and is commonly referred as "permutation importance". It is common to normalise the variables in some way by other having them add up to 1 (or 100) or just assume that the most important variable has importance 1 (or 100). In the case you mention it seems that the normalisation is done so they add up to 1.

The second way that (e.g. used in sklearn) is to traverse the tree and record how much a given metric (e.g. MSE) changes every time a given variable is used for splitting. We get the average reductions across all base-learners for each variable, normalise it and we are good to go. Usually this is what is mentioned are "relative importance" and it considerable faster than the permutation importance. This is the one described in the linked 2013 paper by Natekin & Knoll.

Some further comments relating directly to the questions mentioned:

All the features have "some importance". Given that the feature importance is not zero we can say that it is unnecessary.
In the example shown, variable1 does not "explain 50% of variance". It is the most importance feature and accounts for 50% of the reduction to the loss function given this set of features. If the model's overall fit is miniscule, even if a certain feature has high relative importance, that feature's relative importance does not mean much. Half of almost nothing, is still almost nothing.
The statements made have to be in relation to the particular application. To that respect if we want to focus to a particular feature it may worth to doing a manual permutation importance ourselves and then report the overall loss reduction. (e.g. "While the original model $X$ using feature $Y$ has loss $Z$, then model trained with the permuted feature $Y^*$ has loss $10Z$. etc. etc.")

Related Solutions

Solved – Relative variable importance with AIC

This is some further advise/discussion I was given:

AIC RIW can only be calculated from a balanced candidate model set. If you have 3 variables (e.g. repro, time & WR) then the balanced set (without interactions) is

repro
time
WR
repro + time
repro + WR
time + WR
repro + time + WR
intercept only

the number of models in the set is 2 to the power of the number of explanatory variables (in this case = 8) with 2-way interactions your candidate model set ALSO includes the following (i.e. in addition to those above)

repro + time + repro*time
repro + WR + repro*WR
time + WR + time*WR
repro + time + WR + repro*time
repro + time + WR + repro*WR
repro + time + WR + time*WR

If you want the 3-way interaction, then you would ALSO add this to all of the models described above.

Each variable relative importance weight is then the SUM of ALL AIC-weights from models that contain that variable. Because AIC-weights are standardized to sum to one within a candidate model set, then RIW for each variable can range from 0 to 1.

Do not divide the result by the number of models it is contained in – it is the total sum. I would only use these for balanced candidate model sets; I wouldn’t use RIW for a smaller number of models.

NOTE that if you include interactions, then you can only compare the RIWs of main effects with each other, and you can only compare the RIWs of interactions with each other. You cannot compare main effect RIWs with interaction RIWs (because main effects are present in more models than interactions).

FYI: a strong explanatory variable will have a RIW of around 0.9, moderate effects of around 0.6-0.9, very weak effects of around 0.5-0.6 and below that, forget about it. For interactions, a strong effect could be >0.7, moderate >0.5. If you’re not using RIWs then simply look at your model table and see if you get consistent improvements in AIC when you add specific variables, and by how much. Strong effects will often give you improvements in AIC of >5, moderate 2-5 and weak 0-2. If you don’t get an improvement at all, then it isn’t explaining anything.

if you don’t have a balanced candidate set, but DO have the AIC weights (which it appears you do), then you can simply use the ratios of these to determine the strength of support for one model over another. E.g. if you have model 1 with AIC-weight of 0.7 and model 2 with an AIC-weight of 0.15; then model 1 has 4.6 times more support from the data than model 2 (0.7/0.15). You can use this to assess the relative strength of variables as they go in and out of models. But you don’t NEED to do these calculations – and can simply refer the reader to the table. Especially if you have a dominant model; or a series of models at the top that all contain a particular variable. Then it is simply obvious to everyone that it is important.

Solved – How is gain computed in XGBoost regressor

After all it seems that gain is computed in quite complicated manner and it's explanation can be found at https://xgboost.readthedocs.io/en/latest/tutorials/model.html at the bottom of the page.

Gain is a metric defined by XGBoost and it also involves evaluation of the structure of the tree.

Due to the complexity of the explanation, it is not copied in here.

Let's get specific with a small example:

Best Answer

Related Solutions

Solved – Relative variable importance with AIC

Solved – How is gain computed in XGBoost regressor

Related Question