Solved – why slow learning rate, more iterations would be better (comparing to large learning rate less iterations)

boostingmachine learningoptimization

why slow learning rate, more iterations would be better (comparing to large learning rate less iterations)?

I think this statement is generally true for neural network or gradient boosting models.

Here is one example to demonstrate the problem. In the demo, we are trying to approximate a 2D quadratic function. Ground truth is shown in left figure, and the approximation is shown in right figure. Gradient boosting model is used, where $f(x)=\sum_{i=1}^n \alpha\cdot b_n(x)$. $n$ is the iterations, and $\alpha$ is the learning rate.

  1. learning rate $0.2$, iteration $50$.
    enter image description here

  2. learning rate $0.02$, iteration $500$.
    enter image description here

Intuitively I can understand small learning rate is "fine tuning" instead of "rough tuning", but is there any formal explanations?

a related post

Boosting: why is the learning rate called a regularization parameter?

Best Answer

Gradient descent uses the gradient of the cost function evaluated at the current set of coefficients to decide next best choice to minimize the cost function.

I'm assuming we're only using the first derivative to find the gradient and the learning rate is $\alpha$. Lets use an example of a cost function as $10 b^2$. If the current best value of b is -20, then the gradient at this point is -40 and the negative gradient is 40. So we add $\alpha$ x 40 to the current value of b. If we use a value of alpha = 1, then our next best value of b will be 20, which has "overshot" the optimal point of b=0. Continuing with the same value of $\alpha$ = 1 , we get the next gradient as -40 which would take us back to b=-20; thus the algorithm would never converge on the global optimum.

If we use a smaller value of $\alpha$=0.1, then the next best value of b would be -20 + 4 = -16, which is approaching the optimal and it looks like it will converge.

If we use a slightly larger value of $\alpha$=0.2, then the next best value of b would be -20 + 8 = -12, which is closer to the optimal but the step size is larger at this point looks like it may miss the global optimum. But as we approach the global minimum, the gradient also decreases, at b=-12 the gradient = -24, which gives a step size of 4.8 .

So this shows that a smaller value of $\alpha$ also means it will take many more iterations for it to converge than a higher value of $\alpha$.

enter image description here

Hence, a smaller $\alpha$ (learning rate) results in a smaller step size and a better approximation of the true derivative, which in turn improves the ability to locate the optimal point.

However, as the algorithm approaches the true optimum the error of fit decreases, consequently decreasing the step size and improving the approximation of the derivative. But a large learning rate may hamper this by over-estimating the gradient resulting in the algorithm "oscillating" around the true optimum.

In practice, the truncation/round-off of small floating point numbers should also be taken into account so that the resulting step size is not too small to be used for floating point operations.