Termination Condition – How to Define for Gradient Descent

algorithmsgradient descentoptimization

Actually, I wanted to ask you how can I define the terminating condition for gradient descent.

Can I stop it based upon the number of iterations, i.e. considering parameter values for, say, 100 iterations?

Or should I wait such that the different in the two parameters values 'new' and 'old' is very small to the order of lets say $10^{-6}$? This will definitely take much time.

What is the best way? In my case even one iteration takes significant time. In this situation if I wait for the 2nd condition it might even take weeks I guess.

So which approach should I use. How to tackle this scenario?

Best Answer

Nice question. I've seen lots of stopping rules in the literature, and there are advantages and disadvantages to each, depending on context. The optim function in R, for example, has at least three different stopping rules:

  • maxit, i.e. a predetermined maximum number of iterations. Another similar alternative I've seen in the literature is a maximum number of seconds before timing out. If all you need is an approximate solution, this can be a very reasonable. In fact, there are classes of models (especially linear models) for which early stopping is similar to putting a Gaussian prior on your parameter values. A frequentist would say you have an "L2 norm" rather than a prior, but they would also think of it as a reasonable thing to do. I've only skimmed this paper, but it talks about the relationship between early stopping and regularization and might help point you toward more information. But the short version is, yes, early stopping can be a perfectly respectable thing to do, depending on what you're interested in.

  • abstol, i.e., stop when the function gets "close enough" to zero. This may not be relevant for you (it doesn't sound like you're expecting a zero), so I'll skip it.

  • reltol, which is like your second suggestion--stop when the improvement drops below a threshold. I don't actually know how much theory there is on this, but you'll probably tend to get lower minima this way than with a small maximum number of iterations. If that's important to you, then it might be worth running the code for more iterations.

Another family of stopping rules has to do with optimizing a cost function on a validation data set (or with cross-validation) rather than on the training data. Depending on what you want to use your model for, you might want to stop well before you get to the local minimum on your training data, since that could involve overfitting. I'm pretty sure Trevor Hastie has written about good ways of doing this, but I can't remember the citation.

Other possible options for finding lower minima in a reasonable amount of time could include:

  • Stochastic gradient descent, which only requires estimating the gradients for a small portion of your data at a time (e.g. one data point for "pure" SGD, or small mini-batches).

  • More advanced optimization functions (e.g. Newton-type methods or Conjugate Gradient), which use information about the curvature of your objective function to help you point in better directions and take better step sizes as you move downhill.

  • A "momentum" term in your update rule, so that your optimizer does a better job of rolling downhill rather than bounding off canyon walls in your objective function.

These approaches are all discussed in these lecture notes I found online.

Hope this helps!

Edit oh, and you can also try to get better starting values (e.g. by solving a simpler version of the problem) so that it takes fewer iterations to get close to the optimum from your "warm start".