Solved – What parameter of GBM does gradient descent update after calculating gradient of loss function

boostingcartgradient descentmachine learningoptimization

I am going through Elements of statistical Learning and trying to understand GBM algorithm.

The algorithm of GBM is shown below.

I understand general gradient descent algorithm mentioned below very well.

Questions

Which parameter (theta j in the above picture) of GBM is gradient descent updating using each new tree that is added to GBM? Can you explain the above GBM algorithm intuitive in this context?
What is the gamma in the GBM algorithm and intuition behind it?
Seems gamma is calculated for each terminal region per each tree. What does it mean/do?
GBM does not use reweighing of training samples unlike Adaboost which does. True or False?

Best Answer

GBM is gradient descent in the function space rather than the parameter space. GBM uses gradient descent to calculate the iteration residuals for tree construction. The residuals can be thought of as the step direction.
The gamma is computing the gradient descent step which is the terminal node predictions for that iteration. This can be thought of as the step size.
You are correct about gamma being computed for every terminal region. This is only because the base-learner in GBM is a decision tree.
True, although GBM supports the exponential loss function which Friedman proved to be equivalent to adaboost.

"The number of updates required to reach convergence usually increases with training set size".

I found this statement confusing but as @DeltaIV kindly pointed out in the comments I think they are talking about practical considerations for a fixed model as the dataset size $m \rightarrow \infty$. I think there are two relevant phenomena:

performance tradeoffs when you try to do distributed SGD, or
performance on a real-world non-convex optimisation problem

Computational tradeoffs for distributed SGD

In a large volume and high rate data scenario, you might want to try to implement a distributed version of SGD (or more likely minibatch SGD). Unfortunately making a distributed, efficient version of SGD is difficult as you need to frequently share the parameter state $w$. In particular, you incur a large overhead cost for synchronising between computers so it incentivises you use larger minibatchs. The following figure from (Li et al., 2014) illustrates this nicely

SGD here is minibatch SGD and the other lines are algorithms they propose.

However it's also known that for convex problems minibatchs incur a computational cost which effectively slows convergence with increased minibatch size. As you increase minibatch size $m'$ towards the dataset size $m$ it becomes slower and slower until you're just doing regular batch gradient descent. This decrease can be kinda drastic if a large mini-batch size is used. Again this is illustrated nicely by (Li et al, 2014):

Here they are plotting the minimum objective value they found on the KDD04 dataset after using $10^7$ datapoints.

Formally, if $f$ is a strongly convex function with global optimum $w^*$, and if $w_t$ is the output of SGD at the $t^{th}$ iteration. Then you can prove that in expectation you have: $$ \mathbb{E}[f(w_t) - f(w^*)] \leq \mathcal{O}(\frac{1}{\sqrt{t}}). $$ Note that this doesn't depend on the dataset size (this is relevant to your next question)! However for minibatch SGD with batch size $b$ the convergence to the optimum happens at a rate of $\mathcal{O}(\frac{1}{\sqrt{bt}} + \frac{1}{bt})$. When you have a very large amount of data (Li et al., 2014) make the point that:

Since the total number of examples examined is $bt$ while there is only a $\sqrt{b}$ times improvement, the convergence speed degrades with increasing minibatch size.

You have a trade-off between the synchronisation cost for small mini-batches, and the performance penalty for increased minibatch size. If you naively parallelise (minibatch) SGD you pay some penalty and your convergence rate slows as the data size increases.

Nonconvexity of the empirical optimisation problem

This is basically the point that @dopexxx has made. It's pretty well-known that the optimisation problem that you want to solve for deep neural nets in practice is not convex. In particular, it can have the following bad properties:

local minima and saddle points
plateau regions
widely varying curvature

In general the shape you're trying to optimise over is thought to be a manifold and as far as I am aware all you can say about convergence is that you're going converge to (a noise ball around) a stationary point. It seems reasonable that the greater the variety of real world data, the more complicated this manifold will be. Because the real gradient is more complicated as $m$ increase you need more samples to approximate $\nabla f$ with SGD.

"However, as m approaches inﬁnity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing m further will not extend the amount of training time needed to reach the model’s best possible test error."

(Goodfellow et al., 2016) state this a little more precisely in the discussion in section 8.3.1, where they state:

The most important property of SGD and the related minibatch or online gradient-based optimisation is that computation time per update does not grow with the number of training examples. This allows convergence even when the number of training examples becomes very large. For a large enough dataset, SGD may converge to within some fixed tolerance of its final test set error before it has processed the entire training dataset.

(Bottou et al., 2017) offer the following clear explanation:

On an intuitive level, SG employs information more efficiently than a batch method. To see this, consider a situation in which a training set, call it $S$, consists of ten copies of a set $S_{sub}$. A minimizer of empirical risk for the larger set S is clearly given by a minimizer for the smaller set $S_{sub}$, but if one were to apply a batch approach to minimize $R_n$ over $S$, then each iteration would be ten times more expensive than if one only had one copy of $S_{sub}$. On the other hand, SG performs the same computations in both scenarios, in the sense that the stochastic gradient computations involve choosing elements from $S_{sub}$ with the same probabilities.

where $R_n$ is the empirical risk (essentially the training loss). If you let the number of copies get arbitrarily large, then SGD will certainly converge before it's sampled every data point.

I think this agrees with my statistical intuition. You can achieve $|x^{\text{opt}} - x_k| < \epsilon$ at iteration $k$ before you necessarily sample all the points in your dataset, because your tolerance for error $\epsilon$ really determines how much data you need to look at, i.e. how many iterations of SGD you need to complete.

I found the first part of the paper by (Bottou et al., 2017) quite helpful for understanding SGD better.

References

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning." arXiv preprint arXiv:1606.04838 (2016). https://arxiv.org/pdf/1606.04838.pdf

Li, Mu, et al. "Efficient mini-batch training for stochastic optimization." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf

Solved – Understanding weak learner splitting criterion in gradient boosting decision tree (lightgbm) paper

After I obtained some help from the authors, I can write down now how I understand it. Somebody jump in, if there is disagreement.

Say, we have some differentiable loss function $L(y,H(x))$ , where $H(x)$ is our tree ensemble at some iteration. Let $g_i$ be the gradient of our loss function at some entry corresponding to observation i.

In each iteration, the gradient is our new label vector on which we fit a regression tree. Like, $\tilde{y_i} := g_i$

Let's only consider the gradient instances belonging to some parent node at some iteration. So, when I write $\forall g_i$ I mean all the instances in this parent node. Let $L = \left\{ g_j | x_{j,s} \leq d \right\}$ and define R similar. Then we search the best variable s with splitting point d for the next split.

Therefore, we choose s and d according to

$ \min_{s,d} \sum_{g_i \in L}^{}(g_i - \bar{g}_L)^2 + \sum_{g_i \in R}^{}(g_i - \bar{g}_R)^2 - \sum_{\forall g_i }^{}(g_i - \bar{g})^2 \\ \quad = \sum_{g_i \in L}^{}g_i^2 - n_L *\bar{g}_L^2 + \sum_{g_i \in R}^{}g_i^2 - n_R *\bar{g}_R^2 - (\sum_{\forall g_i}^{}g_i^2 - n *\bar{g}^2) $

(as $\sum_{g_i \in L}^{}g_i^2 + \sum_{g_i \in R}^{}g_i^2 = \sum_{\forall g_i}^{}g_i^2 $, these terms cancel out)

$\quad = - n_L *\bar{g}_L^2 - n_R *\bar{g}_R^2 + n *\bar{g}^2 $

Now, $n *\bar{g}^2$ is always the same, independent of how we make the split. Hence, for the minimization we can ignore it. Therefore, the minimization from the first line is equivalent to:

$ \min_{s,d}\quad - n_L *\bar{g}_L^2 - n_R *\bar{g}_R^2 $,

which is equivalent to

$ \max_{s,d} \quad n_L *\bar{g}_L^2 + n_R *\bar{g}_R^2 \\ \quad \quad = n_L * (\frac{1}{n_L}\sum_{g_i \in L}g_i)^2 + n_R * (\frac{1}{n_R}\sum_{g_i \in R}g_i)^2 \\ \quad\quad = n_L * \frac{1}{n_L^2} (\sum_{g_i \in L}g_i)^2 + n_R * \frac{1}{n_R^2} (\sum_{g_i \in R}g_i)^2 \\ \quad\quad = \frac{(\sum_{g_i \in L}g_i)^2}{n_L} + \frac{(\sum_{g_i \in R}g_i)^2}{n_R} $

This is almost exactly the formula from the picture but they weight this with the overall number of instances in the parent node. I assume, this is done to compare different splits between different nodes because they use best-first splitting.

Best Answer

Related Solutions

Solved – Convergence of Stochastic Gradient Descent as a function of training set size

"The number of updates required to reach convergence usually increases with training set size".

"However, as m approaches inﬁnity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing m further will not extend the amount of training time needed to reach the model’s best possible test error."

References

Solved – Understanding weak learner splitting criterion in gradient boosting decision tree (lightgbm) paper

Related Question