First please check Matthew Drury's answer here for a similar question.
In general, model complexity can be defined as a function of number of free parameters: the more free parameters a model has, the more complex the model is. In addition, if a model has many parameters but we put many restrictions to them, so they are not that "free", it would also give us a "simpler" model. That is the key idea behind regularization.
There are different definitions of model complexity. AIC and BIC are widely used.
In the link you provided, the notations are not standardized or widely used in statistics / machine learning fields but with some customization of author's preference. As mentioned in the document, $T$ is number of leaves in a boosting tree and $w_i$ is score for each leaves. So intuitively, the more leaves we have, the more free parameters we have, and the large the weights are the more complex the model is (which is similar to ridge regression)
To conclude, as mentioned in the original link, this is one way of defining the complexity (not a "standard way"), and the it penalize number of leaves and the weights of the leaves using L2 norm.
$\gamma$ is a threshold for the gain.
if the gain is smaller than $\gamma$, we would do better not to add that branch.
$\lambda$ is a regularization parameter.
The larger the $\gamma$ and $\lambda$ are the more regularization on the model / simpler the model is.
check this link and search gamma
and lambda
.
gamma [default=0]
minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
lambda [default=1]
L2 regularization term on weights, increase this value will make model more conservative.
The contradiction you noticed is precisely the idea of the regularization: you want to exchange confidence over the training set for confidence over the test set, since being too confident over the training set does not imply the model will generalize well. Indeed, you could be just fitting training set noise. Then, when you penalize weights, you usually end up with a simpler model (imagine that some weights may become zero) that may perform worse on the training set, but performs better over unseen data.
Best Answer
This makes sense to me.
I'll focus on the Gaussian case. Here each tree $T_i$ is fit on the residuals of the current model, and the model update is $M_{i+1} = M_{i} + \alpha T_i$. The idea of a gradient booster is to carefully and slowly reduce the bias of the model by adding these trees one by one.
In this case, a large value of $w_i$ would correspond to a terminal (leaf) node giving a very large and significant update to the prior model. The idea of the regularization term is to minimize these incidents of large single tree updates (only allowing them if the decrease in the model loss function is large enough to offset the regularization penalty). If such an update is regularized away for a single tree, but turns out to be justified, it will be baked in over multiple model updates, in accordance with the philosophy of boosting.
This is in very close analogy to ridge regression.