Conditions to apply Gradient Descent

convex optimizationgradient descentmachine learningneural networksoptimization

I was reading "An introduction to optimization" by Edwin KP, there is a chapter where they explain the basics of Neural Networks and as example; they use gradient descent with sigmoid function as activation function for a Neural network of 3 layers (input, hidden and output layer).

Also, all the material that i found (book included), to ensure the convergence of the algorithm (theorems), the function to minimize have to satisfy some conditions as: to be convex, have lipschitz gradient, to be strongly convex.

My questions are: In practice, have to we ensure that the function to minimize have those propierties? In particular, if a use least squares with sigmoid funtion (as the example in the book), will gradient descent still working? why?

PD: In advance, im thankful for your help and i apologize for my english.

Best Answer

I think you might be mixing some things here. In general, gradient descent is probably the simplest optimization algorithm there is, and it requires very few assumptions. You do have to assume that the function is continuously differentiable. However, if you seek to ensure convergence to a global minimum or you wish to derive the complexity of the algorithm, things get a bit more complicated.

Given that you make no assumptions on the function, then gradient descent can only be guaranteed to advance towards a stationary point (local min/max or saddle point), if such a point exists. The other assumptions will give you additional guarantees:

  • If we assume the function has a Lipschitz gradient - then we can derive iteration complexity (how much the algorithm "progresses" from one iteration to the other).
  • If we assume the function is convex - then in case it will stop, it must be at a local minimum (however that minimum need not exist).
  • If we assume the function has bounded level sets, then we are guaranteed it will converge to a stationary point. Bounded level sets can be attained in various ways, most notably if the domain is compact, or if the function is coercive or strictly convex.
  • Strictly convex functions admit a unique global minimum (while convex functions might have a range of minimizers).
  • Strongly convex functions give an improved iteration complexity.

Note that neural networks optimize cost functions that do not meet any of these assumptions. Moreover, they may not even be continuously differentiable. Therefore neural networks have no theoretical guarantees, but in practice they will converge to a local minimum or saddle point. The rate of convergence will be unknown, and in fact they will use a variant of gradient descent called stochastic gradient descent (SGD). SGD processes the data in batches at every iteration, since calculating and storing the gradients might be too expensive from the perspective of memory and computation resources required.

Related Question