Optimization – Definition of a First Order Method

non-convex-optimizationnonlinear optimizationnumerical optimizationoptimization

The term "first order method" is often used to categorize a numerical optimization method for say constrained minimization, and similarly for a "second order method".

I wonder what is the exact definition of a first (or second) order method.

Does this term related to the convergence rate of the method or to the fact that one utilize explicitly first (or second) derivatives of the function to be minimized?

More particularly, Newton is considered second order method. It uses the second order derivatives (Hessian) and has quadratic convergence rate. But what if for example the energy is not convex? In this case the Hessian has to be modified to be (symmetric) positive definite. In such a case, the second order derivatives are used but it's not clear what exactly the convergence rate is. It can be linear depending on the actual modification of the Hessian.
So is that a second order or first order method?

Best Answer

Difference between 1st and 2nd order algorithm:

Any algorithm that requires at least one first-derivative/gradient is a first order algorithm. In the case of a finite sum optimization problem, you may use only the gradient of a single sample, but this is still first order because you need at least one gradient.

A second order algorithm is any algorithm that uses any second derivative, in the scalar case. To elaborate, Newton's step requires use of the Hessian which has second order derivatives i.e. $H_{i,j} = \partial_i \partial_j f(x)$.

In general, an algorithm can be classified as $n^{th}$ order optimization if it uses a tensor of rank $n$.

Important Distinguishment:

When an algorithm uses an approximated version of the second order information (Hessian) for optimization, it may or may not be a second order algorithm. I am emphasizing this because, for example, you may approximate the Hessian by the outer product of the gradient $H = \nabla f(x) \nabla f(x)^T$. In this case you are only using first order information, so this would be considered a first order algorithm. On the other hand, if you are approximating the Hessian by only computing the diagonal elements $\partial_i \partial_i f(x)$, this would be a second order algorithm.

The order of an algorithm can be thought to be the order of error resulting from approximation of the derivative. Recall the definition of a derivative: $$\partial_i f(x) \approx \frac{f(x + \epsilon \Delta x_i) - f(x)}{\epsilon}$$

and the second derivative: $$\partial_{ij} f(x) \approx \frac{\partial_j f(x + \epsilon \Delta x_i) - \partial_j f(x)}{\epsilon},$$

where $x_i$ is equal to $x$ only in the $i^{th}$ coordinate and $0$ elsewhere. Hence if you use an $\epsilon$ approximation for each derivative computation, you will get an $\epsilon^2$ error inherently in your optimization algorithm.

Because it is sort of included in your question, I would like to add that first order algorithms do not necessarily converge linearly. Stochastic Gradient Descent (SGD) converges sub-linearly and many second order algorithms do not converge quadratically.