Solved – Are line search methods used in deep learning? Why not

deep learningmachine learningneural networksoptimization

A lot of tutorials online talk about gradient descent and almost all of them use a fixed step size (learning rate $\alpha$). Why is there no use of line search (such as backtracking line search or exact line search)?

Best Answer

Vanilla gradient descent can be made more reliable using line searches; I've written algorithms that do this and it makes for a very stable algorithm (although not necessarily fast).

However, it makes almost no sense to do a line search for stochastic gradient methods. The reason I say this is that if we do a line search based on minimizing the full loss function, we've immediately lost one of the main motivations for doing stochastic methods; we now need to compute the full loss function for each update, which typically has computational cost comparable to computing the full first derivative. Given that we wanted to avoid computing the full gradient because of computational costs, it seems very unlikely that we want be okay with computing the full loss function.

Alternatively, you might think of doing something like a line search based on your randomly sampled data point. However, this isn't a good idea either; this will tell you nothing about whether you have stepped too far (which is the main benefit of line searches). For example, suppose you are performing logistic regression. Then each outcome is simply a 0 or 1, and for any single sample, we trivially get perfect separation so the optimal solution for our regression parameters based on the sample of 1 is trivially $-\infty$ or $\infty$ by the Hauck Donner effect. That's not good.

EDIT

@DeltaIV points out that this also applies to mini-batch, not just individual samples.