I came across one comment in an xgboost tutorial. It says "Remember that gamma brings improvement when you want to use shallow (low max_depth) trees".
My understanding is that higher gamma higher regularization. If we have deep (high max_depth) trees, there will be more tendency to overfitting. Why is it the case that gamma can improve performance using shallow trees?
Here is the tutorial link
https://www.hackerearth.com/fr/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Best Answer
As you correctly note
gamma
is a regularisation parameter. In contrast withmin_child_weight
andmax_depth
that regularise using "within tree" information,gamma
works by regularising using "across trees" information. In particular by observing what is the typical size of loss changes we can adjustgamma
appropriately such that we instruct our trees to add nodes only if the associated gain is larger or equal to $\gamma$. In the rather famous 2014 XGBoost presentation by Chen, p.33 refers to $\gamma$ as the "complexity cost by introducing additional leaf".Now, a typical situation where we would tune
gamma
is when we use shallow trees as we try to combat over-fitting. The obvious thing to combat overfitting is use shallower trees (i.e. lowermax_depth
) and therefore the context where tuninggamma
becomes relevant is "when you want to use shallow (low max_depth) trees". Indeed it is a bit of a tautology but realistically if we expect deeper trees to be beneficial, tuninggamma
, while still effective in regularising, will also unnecessarily burden our learning procedure. On the other hand, if we wrongly use deeper trees, unless we regularise very aggressively we might be accidentally end-up to a local minimum where $\gamma$ cannot save us from. Therefore, $\gamma$ is indeed more relevant for "shallow-tree situations". :) A great blog post on tuning $\gamma$ can be found here: xgboost: “Hi I’m Gamma. What can I do for you?” - and the tuning of regularization.Final word of caution: Do notice that $\gamma$ is strongly dependant on the actual estimated parameters and the (training) data. That is because the scale of our response variable effectively dictates the scale of our loss function and what are the subsequent reductions in the loss function (i.e. values of $\gamma$) we consider meaningful.