The advantage of using mirror descent over gradient descent is that it takes into account the geometry of the problem through the potential function $\Phi$.
We can see mirror descent as a generalization of the projected gradient descent, which ordinarily is based on an assumed euclidean geometry. Projected gradient descent is a special case of mirror descent using the potential $\Phi = \|\cdot\|^2/2$. But sometimes there are other choices of $\Phi$ that can better model the geometry of our problem. Examples include $x\log x$ or $-\log x$, both of which are useful for problems over the positive orthant, simplex, etc.
To answer 1), it depends on the geometry of the problem. In high dimensions (large scale optimization), it can be advantageous to abandon the euclidean geometry to improve the constants in our convergence rates. As an example, consider the differentiable convex function $f$ on the Euclidean ball $B_{2,n}$ with $\|\nabla f(x)\|_{\infty}\leq 1$ for all $x\in B_{2,n}$. Then, $|\nabla f(x)|_2\leq \sqrt{n}$ and the projected gradient descent converges at a rate of $\sqrt{\frac{n}{k}}$. However, using mirror descent with an appropraitely chosen $\Phi$ we can get a rate of $\sqrt{\frac{\log(n)}{k}}$. When $n$ is quite large, this is a huge improvement. The key step is that our choice of $\Phi$ (potential) determines the new geometry.
To answer 2), many times when considering probability distributions the euclidean distance is lackluster and we prefer to use the so called Kullback Liebler divergence to measure the 'distance' between two probability distributions. In such scenarios, mirror descent with $\Phi$ set to be the entropy provides better constants in our convergence rates.
To answer 3), we map our primal point $x$ to the dual space (through the mapping $\nabla \Phi$) and take a step in the direction given by the gradient of our function, then we map back to the primal space (through $\nabla\Phi^{-1}$) after that step. For spaces which are reflexive (i.e. $X^{**} = X$) this is not a big deal, for example in Hilbert spaces we do not worry so much since the norm arises naturally from an inner product and our space is isomorphic to its dual already. However, in a more general context (any banach space), we have to be careful about handling this duality.
I know this is a vague answer that doesn't quite answer your questions, but I hope it helps you find the right way.
In your quote, $L$ is a functional, which takes a function and returns a real number. It absolutely makes sense to talk about optimizing the function argument $f$ for given operators like $L$, and the field of Functional Analysis is dedicated to studying these sorts of optimization problems, among other things.
Best Answer
Sure. There are a number of methods, but the most common is called Halley's method. I prefer the paper "On large-scale unconstrained optimization problems and higher order methods" by Gundersen and Steihaug.
That said, there are some good reasons to do this and some good reasons not to do this. The snarky answer as to why not is that Newton's method converges quadratically in a neighborhood of the solution. A third order method can converge cubically, but who cares if we can just do another quadratic iteration? That answer disregards that a second-order method does more than just converge quadratically. It takes care of affine scalings even away from the solution and that's valuable. A third-order method can represent even more complicated scalings and that may be valuable.
Really, the reason not to use one is pragmatic and deals with linear system solvers. Even though we call Newton's method a second-order method, it's really a first derivative trick. Really, what we're doing is writing the first order optimality conditions as a nonlinear system $F(x)=\nabla f(x) = 0$. Then, Newton's method is simply a first order Taylor series solution to this problem:
$F^\prime(x)\delta x + F(x) = 0$
becomes
$\nabla^2 f(x)\delta x + \nabla f(x) = 0$
Alright, so we could do an additional term in the Taylor series. Recall,
$$ F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x)(\delta x) $$
Here's where we run into trouble. $F^\prime(x)=\nabla^2 f(x)$ is a linear system and we know how to solve linear systems on a computer. We can factorize it. We can apply a Krylov method to it. We can do all sorts of things to it. That said, $F^{\prime\prime}(x)\in L(X,L(X))$. Technically, it's a linear operator, but it returns another linear operator, which we can then solve. As such, we don't have a straightforward answer on how to set the Taylor series above to zero and solve. What Halley's method does is cheat. It really solves for the Newton step $\delta x_{\textrm{Newton}}=-F^{\prime}(x)^{-1}F(x)$ and then plugs this into one of the directions for the last derivative. Hence, we have that
$$ F(x+\delta x) \approx F(x) + F^\prime(x)\delta x + \frac{1}{2}F^{\prime\prime}(x)(\delta x_{\textrm{Newton}})(\delta x) $$
Then, we can solve
$$ (F^\prime(x) + \frac{1}{2}F^{\prime\prime}(x)\delta x_{\textrm{Newton}})\delta x = -F(x) $$
or
$$ (\nabla^2 f(x) + \frac{1}{2}\nabla^3 f(x)\delta x_{\textrm{Newton}})\delta x = -\nabla f(x) $$
Alright, a pain, but somewhat workable. Of course, now we have other problems. Someone has to sit down and derive out the derivative, which can be prohibitively difficult. This can be alleviated by using automatic differentiation (AD.) However, if we want to use AD, we have to be smart about things. Using a forward-over-reverse scheme is a good way to get Hessian-vector products efficiently and most AD tools are setup to easily provide this information. They're not entirely setup to provide the third derivative information in the form necessary for Halley's method. Though, they can absolutely do it. Finally, none of these methods are guaranteed to converge away from the optimal solution, so they need to be globalized with a trust-region or line-search. Given that we're now computing two directions for each step, the Newton and the Halley, there's some question as to if we need to globalize both or just one. I know there's an answer to this, but I don't know it off the top of my head. Mostly, it makes implementing the method a pain.
Anyway, in summary, the difficulty with higher-order methods is:
The benefit, in my mind, is the ability to represent more complicated function phenomena than quadratic especially away from the optimal solution. Really, I do think there's benefit in these methods, but I don't know of anyone who has had the patience to really implement them.