Sure. There are a number of methods, but the most common is called Halley's method. I prefer the paper "On large-scale unconstrained optimization problems and higher order methods" by Gundersen and Steihaug.
That said, there are some good reasons to do this and some good reasons not to do this. The snarky answer as to why not is that Newton's method converges quadratically in a neighborhood of the solution. A third order method can converge cubically, but who cares if we can just do another quadratic iteration? That answer disregards that a second-order method does more than just converge quadratically. It takes care of affine scalings even away from the solution and that's valuable. A third-order method can represent even more complicated scalings and that may be valuable.
Really, the reason not to use one is pragmatic and deals with linear system solvers. Even though we call Newton's method a second-order method, it's really a first derivative trick. Really, what we're doing is writing the first order optimality conditions as a nonlinear system $F(x)=\nabla f(x) = 0$. Then, Newton's method is simply a first order Taylor series solution to this problem:
$F^\prime(x)\delta x + F(x) = 0$
becomes
$\nabla^2 f(x)\delta x + \nabla f(x) = 0$
Alright, so we could do an additional term in the Taylor series. Recall,
$$
F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x)(\delta x)
$$
Here's where we run into trouble. $F^\prime(x)=\nabla^2 f(x)$ is a linear system and we know how to solve linear systems on a computer. We can factorize it. We can apply a Krylov method to it. We can do all sorts of things to it. That said, $F^{\prime\prime}(x)\in L(X,L(X))$. Technically, it's a linear operator, but it returns another linear operator, which we can then solve. As such, we don't have a straightforward answer on how to set the Taylor series above to zero and solve. What Halley's method does is cheat. It really solves for the Newton step $\delta x_{\textrm{Newton}}=-F^{\prime}(x)^{-1}F(x)$ and then plugs this into one of the directions for the last derivative. Hence, we have that
$$
F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x_{\textrm{Newton}})(\delta x)
$$
Then, we can solve
$$
(F^\prime(x) + \frac{1}{2}F^{\prime\prime}(x)\delta x_{\textrm{Newton}})\delta x = -F(x)
$$
or
$$
(\nabla^2 f(x) + \frac{1}{2}\nabla^3 f(x)\delta x_{\textrm{Newton}})\delta x = -\nabla f(x)
$$
Alright, a pain, but somewhat workable. Of course, now we have other problems. Someone has to sit down and derive out the derivative, which can be prohibitively difficult. This can be alleviated by using automatic differentiation (AD.) However, if we want to use AD, we have to be smart about things. Using a forward-over-reverse scheme is a good way to get Hessian-vector products efficiently and most AD tools are setup to easily provide this information. They're not entirely setup to provide the third derivative information in the form necessary for Halley's method. Though, they can absolutely do it. Finally, none of these methods are guaranteed to converge away from the optimal solution, so they need to be globalized with a trust-region or line-search. Given that we're now computing two directions for each step, the Newton and the Halley, there's some question as to if we need to globalize both or just one. I know there's an answer to this, but I don't know it off the top of my head. Mostly, it makes implementing the method a pain.
Anyway, in summary, the difficulty with higher-order methods is:
- Pain to deal linear systems
- Pain to get higher-order derivatives
- Pain to deal with globalization
The benefit, in my mind, is the ability to represent more complicated function phenomena than quadratic especially away from the optimal solution. Really, I do think there's benefit in these methods, but I don't know of anyone who has had the patience to really implement them.
Best Answer
1)
First-order methods have many variants, for example using conjugate directions instead of steepest descent direction (Conjugate Gradient Method).
There is also a multitude of "line search" algorithms for computing step-length in the first order methods. These include binary search algorithms (Gold section), Newton and quasi-Newton methods: The second order method is used to compute step length in first order method. It seems weird, but the point is that first order method can work in multidimensions, while the second-order line search is used only on univariate objective function (line in step direction).
In theory, you can also use numerical derivatives, i.e. compute them from cost function. This allows you to use first order methods without analytical representation for derivatives. Methods for computing derivatives include Finite difference approximation, Neville's method, Ridder's method and Automatic differentiation.
Second-order methods like Gauss-Newton and Levenberg-Marquardt do not use explicit Hessian:
$$H\approx J^{\mathbb{T}}J$$
The reason behind this approximation (which does not require explicit second-order derivatives) would be outside scope of the answer, but using such Hessian tend to be more numerically stable because second-order terms are noise-sensitive.
2)
There are many methods for derivative-free optimization, some are designed for large and sparse datasets, like the black-box cost function you have. Such methods include: Model-based methods, Coordinate and Pattern-search Methods, Conjugate-Direction Method, Nelder-Mead Method and Implicit Filtering.
Maybe a proper design of your cost function will allow you to omit unknown data without harming the result too much.
The following books helped me greatly and both go into detail when it comes to high-fidelity optimization (large and sparse datasets, unknown derivatives):
Jorge Nocedal, Stephed J. Wright: "Numerical Optimization, Second Edition"
Press, Teukolsky, Vetterling, Flannery: "Numerical Resipes, Third Edition"