For me, the best way to intuitively understand this definition is via directional derivatives. A well known result is that for a continuously differentiable function, the directional derivative of $f$ in the direction $d$, which measures its slope along the ray pointing from some point in direction $d$, is:
$$
f'(x; d) = \nabla f(x)^T d.
$$
The vectors $x - x^*$ for all $x \in C$ point from the point $x^*$ towards all the points of $C$, and the stationarity conditions
$$
\forall x \in C \quad \nabla f(x)^T (x - x^*) \geq 0
$$
means that the function's slope along any direction which points from $x^*$ toward a point in $C$ is non-negative. That is, moving a small distance from $x^*$ toward any point in $C$ does not decrease the function values. In other words, at $x^*$ there are no feasible descent directions.
It is also easy to prove that when $x^*$ is an interior point of $C$, then this condition reduces to the well known condition of $\nabla f(x) = 0$, since for very small vectors $d$ both $x^* + d$ and $x^* - d$ are in $C$, and we get:
$$
\nabla f(x^*)^T d \geq 0, \quad \nabla f(x^*) (-d) \geq 0.
$$
Sure. There are a number of methods, but the most common is called Halley's method. I prefer the paper "On large-scale unconstrained optimization problems and higher order methods" by Gundersen and Steihaug.
That said, there are some good reasons to do this and some good reasons not to do this. The snarky answer as to why not is that Newton's method converges quadratically in a neighborhood of the solution. A third order method can converge cubically, but who cares if we can just do another quadratic iteration? That answer disregards that a second-order method does more than just converge quadratically. It takes care of affine scalings even away from the solution and that's valuable. A third-order method can represent even more complicated scalings and that may be valuable.
Really, the reason not to use one is pragmatic and deals with linear system solvers. Even though we call Newton's method a second-order method, it's really a first derivative trick. Really, what we're doing is writing the first order optimality conditions as a nonlinear system $F(x)=\nabla f(x) = 0$. Then, Newton's method is simply a first order Taylor series solution to this problem:
$F^\prime(x)\delta x + F(x) = 0$
becomes
$\nabla^2 f(x)\delta x + \nabla f(x) = 0$
Alright, so we could do an additional term in the Taylor series. Recall,
$$
F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x)(\delta x)
$$
Here's where we run into trouble. $F^\prime(x)=\nabla^2 f(x)$ is a linear system and we know how to solve linear systems on a computer. We can factorize it. We can apply a Krylov method to it. We can do all sorts of things to it. That said, $F^{\prime\prime}(x)\in L(X,L(X))$. Technically, it's a linear operator, but it returns another linear operator, which we can then solve. As such, we don't have a straightforward answer on how to set the Taylor series above to zero and solve. What Halley's method does is cheat. It really solves for the Newton step $\delta x_{\textrm{Newton}}=-F^{\prime}(x)^{-1}F(x)$ and then plugs this into one of the directions for the last derivative. Hence, we have that
$$
F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x_{\textrm{Newton}})(\delta x)
$$
Then, we can solve
$$
(F^\prime(x) + \frac{1}{2}F^{\prime\prime}(x)\delta x_{\textrm{Newton}})\delta x = -F(x)
$$
or
$$
(\nabla^2 f(x) + \frac{1}{2}\nabla^3 f(x)\delta x_{\textrm{Newton}})\delta x = -\nabla f(x)
$$
Alright, a pain, but somewhat workable. Of course, now we have other problems. Someone has to sit down and derive out the derivative, which can be prohibitively difficult. This can be alleviated by using automatic differentiation (AD.) However, if we want to use AD, we have to be smart about things. Using a forward-over-reverse scheme is a good way to get Hessian-vector products efficiently and most AD tools are setup to easily provide this information. They're not entirely setup to provide the third derivative information in the form necessary for Halley's method. Though, they can absolutely do it. Finally, none of these methods are guaranteed to converge away from the optimal solution, so they need to be globalized with a trust-region or line-search. Given that we're now computing two directions for each step, the Newton and the Halley, there's some question as to if we need to globalize both or just one. I know there's an answer to this, but I don't know it off the top of my head. Mostly, it makes implementing the method a pain.
Anyway, in summary, the difficulty with higher-order methods is:
- Pain to deal linear systems
- Pain to get higher-order derivatives
- Pain to deal with globalization
The benefit, in my mind, is the ability to represent more complicated function phenomena than quadratic especially away from the optimal solution. Really, I do think there's benefit in these methods, but I don't know of anyone who has had the patience to really implement them.
Best Answer
In your quote, $L$ is a functional, which takes a function and returns a real number. It absolutely makes sense to talk about optimizing the function argument $f$ for given operators like $L$, and the field of Functional Analysis is dedicated to studying these sorts of optimization problems, among other things.