Solved – Interpretation of a gradient boost model

boostingr

I recently did a gradient boost model to predict an event Y/N.
I have a lot of features and a huge dataset.

After a grid search cross validation, I manage to get an efficient enough model. (It is the verification dataset which show me that).

Now, my issue is that I struggle to do an accurate interpretation of the result, like in proper English. The trees of the algorithm are too numerous to go through. As well, if a good visualisation could be suggested, it would be nice.

Where I am:

I saw that where the caret implementation have a function to visualise the tree. Really good, but still too messy.

The graph by default coming with the gbm implementation is really nice, showing an histogram of the variables by importance. But still too much univariate at my taste.

As I have a verification data set, I did a profile of the Y against the N in my training data set and a profile of True positive Y against true negative N in the verification dataset. give good insight.

In the idea box:

  • Is there a way to simplify the set of tree to do a kind of "summary" of tree?

  • Is there an easy way to represent one variable against the result?

Best Answer

Is there a way to simplify the set of tree to do a kind of "summary" of tree?

Yes, the variable importance histogram is essentially doing this in a reasonably principled way.

Is there an easy way to represent one variable against the result?

Yes, they are called partial dependence plots. You can produce them with the plot fucnction in R, applied to a gbm object. See this question for an example and discussion.

Generally, the more complex and non-parametric your model, the more difficult it is going to be to convince yourself that you can "interpret" it. This is a fundamental tradeoff in modeling.

Related Question