Solved – gamma parameter in xgboost

boostingparameterization

I came across one comment in an xgboost tutorial. It says "Remember that gamma brings improvement when you want to use shallow (low max_depth) trees".

My understanding is that higher gamma higher regularization. If we have deep (high max_depth) trees, there will be more tendency to overfitting. Why is it the case that gamma can improve performance using shallow trees?

Here is the tutorial link
https://www.hackerearth.com/fr/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/

Best Answer

As you correctly note gamma is a regularisation parameter. In contrast with min_child_weight and max_depth that regularise using "within tree" information, gamma works by regularising using "across trees" information. In particular by observing what is the typical size of loss changes we can adjust gamma appropriately such that we instruct our trees to add nodes only if the associated gain is larger or equal to $\gamma$. In the rather famous 2014 XGBoost presentation by Chen, p.33 refers to $\gamma$ as the "complexity cost by introducing additional leaf".

Now, a typical situation where we would tune gamma is when we use shallow trees as we try to combat over-fitting. The obvious thing to combat overfitting is use shallower trees (i.e. lower max_depth) and therefore the context where tuning gamma becomes relevant is "when you want to use shallow (low max_depth) trees". Indeed it is a bit of a tautology but realistically if we expect deeper trees to be beneficial, tuning gamma, while still effective in regularising, will also unnecessarily burden our learning procedure. On the other hand, if we wrongly use deeper trees, unless we regularise very aggressively we might be accidentally end-up to a local minimum where $\gamma$ cannot save us from. Therefore, $\gamma$ is indeed more relevant for "shallow-tree situations". :) A great blog post on tuning $\gamma$ can be found here: xgboost: “Hi I’m Gamma. What can I do for you?” - and the tuning of regularization.

Final word of caution: Do notice that $\gamma$ is strongly dependant on the actual estimated parameters and the (training) data. That is because the scale of our response variable effectively dictates the scale of our loss function and what are the subsequent reductions in the loss function (i.e. values of $\gamma$) we consider meaningful.