[Math] Smoothing L1 norm, Huber vs Conjugate

fa.functional-analysisna.numerical-analysis

I'm trying to minimize a convex (not necessarily strictly convex) function involving an L1 norm (similar to lasso), which makes it non-differentiable at some points. So I'd like to smooth it and treat it as an L2 norm problem.

The two approaches I've seen ( http://www.ee.ucla.edu/~vandenbe/236C/lectures/smoothing.pdf ) are directly smoothing the L1 norm using the Huber function, and smoothing the conjugate (i.e, derive the dual norm, here it's L-infinity, which is still non-differentiable, then smooth that).

The Huber approach is much simpler, is there any advantage in the conjugate method over Huber? I can't see the point of smoothing the dual instead of just smoothing the primal.

Best Answer

Following the suggestion of András Bátkai I post my comment as an answer:

Smoothing the dual or the primal problem are quite different things: Smoothing the dual will not give you a smooth primal. However, you get a strongly convex primal by dual smoothing (as opposed to merely a strictly convex primal by Huber smoothing). Hence, it depends on what kind of regularity you are aiming at: A smoother primal or a "more convex" primal - both can be helpful algorithmically. A smooth primal allows you to use gradients instead of subgradients and in turn allows you to apply gradient methods with appropriate stopping rules and such. A strongly convex primal leads to a proximal mapping of the primal objective which is not only non-expansive but contractive which is favorable for proximal-splitting methods.

Of course, you can also apply both primal and dual smoothing if you like.

Moreover, note that there are numerous methods to treat nonsmooth convex minimization problems efficiently

Related Solutions

[Math] The mathematical theory of Feynman integrals

Most of this is standard theory of path integrals known to mathematical physicists so I will try to address all of your questions.

First let me say that the hypothesis you list for the action $S$ to make the path integral well defined, ie that $S=Q+V$ where $Q$ is quadratic and non-degenerate and $V$ is bounded are extremely restrictive. One should think of $V$ as defining the potential energy for interactions of the physical system and while it certainly true that one expects this to be bounded below, there are very few physical systems where this is also bounded above (this is also true for interesting mathematical applications...). Essentially requiring that the potential be bounded implies that the asymptotic behavior of $S$ in the configuration space is totally controlled by the quadratic piece. Since path integrals with quadratic actions are trivial to define and evaluate, it is not really that surprising or interesting that by bounding the potential one can make the integral well behaved.

Next you ask if anyone has studied the question of when an action $S$ gives rise to a well defined path integral: $ \int \mathcal{D}f \ e^{-S[f(x)]}$

The answer of course is yes. The people who come to mind first are Glimm and Jaffe who have made whole careers studying this issue. In all cases of interest $S$ is an integral $S=\int L$ where the integral is over your spacetime manifold $M$ (in the simplest case $\mathbb{R}^{n}$) and the problem is to constrain $L$. The problem remains unsolved but nevertheless there are some existence proofs. The basic example is a scalar field theory, ie we are trying to integrate over a space of maps $\phi: M \rightarrow \mathbb{R}$. We take an $L$ of the form:

$ L = -\phi\Delta \phi +P(\phi)$

Where in the above $\Delta$ is the Laplacian, and $P$ a polynomial. The main nontrivial result is then that if $M$ is three dimensional, and $P$ is bounded below with degree less than seven then the functional integral exists rigorously. Extending this analysis to the case where $M$ has dimension four is a major unsolved problem.

Moving on to your next point, you ask about another approach to path integrals called perturbation theory. The typical example here is when the action is of the form $S= Q+\lambda V$ where $Q$ is quadratic, $V$ is not, and $\lambda$ is a parameter. We attempt a series expansion in $\lambda$. The first thing to say here, and this is very important, is that in doing this expansion I am not attempting to define the functional integral by its series expansion, rather I am attempting to approximate it by a series. Let me give an example of the difference. Consider the following function $f(\lambda)$:

$f(\lambda)=\int_{-\infty}^{\infty}dx \ e^{-x^{2}-\lambda x^{4}}$

The function $f$ is manifestly non-analytic in $\lambda$ at $\lambda=0$. Indeed if $\lambda<0$ the integral diverges, while if $\lambda \geq 0$ the integral converges. Nevertheless we can still be rash and attempt to define a series expansion of $f$ in powers of $\lambda$ by expanding the exponential and then interchanging the order of summation and integration (illegal to be sure!). We arrive at a formal series:

$s(\lambda)=\sum_{n=0}^{\infty}\frac{\lambda^{n}}{n!}\int_{-\infty}^{\infty}dx \ e^{-x^{2}}(-x^{4})^{n}$

Of course this series diverges. However this expansion was not in vain. $s(\lambda)$ is a basic example of an asymptotic series. For small $\lambda$ truncating the series at finite order less than $\frac{1}{\lambda^{2}}$ gives an excellent approximation to the function $f(\lambda)$

Returning to the example of Feynman integrals, the first point is that the perturbation expansion in $\lambda$ is an asymptotic series not a Taylor series. Thus just as for $s(\lambda)$ it is misguided to ask if the series converges...we already know that it does not! A better question is to ask for which actions $S$ this approximation scheme of perturbation theory itself exists. On this issue there is a complete and rigorous answer worked out by mathematical physicists in the late 70s and 80s called renormalization theory. A good reference is the book by Collins "Renormalization." Connes and Kreimer have not added new results here; rather they have given modern proofs of these results using Hopf algebras etc.

Finally I will hopefully answer some of your questions about Chern-Simons theory. The basic point is that Chern-Simons theory is a topological field theory. This means that it suffers from none of the difficulties of usual path integrals. In particular all quantities we want to compute can be reduced to finite dimensional integrals which are of course well defined. Of course since we lack an independent definition of the Feynman integral over the space of connections, the argument demonstrating that it reduces to a finite dimensional integral is purely formal. However we can simply take the finite dimensional integrals as the definition of the theory. A good expository account of this work can be found in the recent paper of Beasley "Localization for Wilson Loops in Chern-Simons Theory."

Overall I would say that by far the currently most developed approach to studying path integrals rigorously is that of discretization. One approximates spacetime by a lattice of points and the path integral by a regular integral at each lattice site. The hard step is to prove that the limit as the lattice spacing $ a $ goes to zero, the so-called continuum limit, exists. This is a very hard analysis problem. Glimm and Jaffe succeeded in using this method to construct the examples I mentioned above, but their arguments appear limited. Schematically when we take the limit of zero lattice size we also need to take a limit of our action, in other words the action should be a function of $ a$. We now write $S(a)=Q+\lambda V+H(a,\lambda)$ Where as usual $Q$ is quadratic $V$ is not an $\lambda$ is a parameter. Our original action is $S=Q +\lambda V$

The question is then can we find an $H(a,\lambda)$ such that a suitable $a\rightarrow 0$ limit exists? A priori one could try any $H$ however the arguments of Glimm and Jaffe are limited to the case where $H$ is polynomial in $\lambda$. Physically this means that the theory is very insensitive to short distance effects, in other words one could modify the interactions slightly at short distances and one would find essentially the same long distance physics. It seems that new methods are needed to generalize to a larger class of continuum limits.

[Math] Choice of Lipschitz constant for proximal gradient optimization

In practice, you would not want to run a vanilla prox-gradient method that requires knowledge of the Lipschitz constant. Instead, you'd use a method that combines line-search (these notes give a nice, quick overview, along with pseudocode). More careful versions of FISTA-style and other prox-gradient methods exist, that do not require knowledge of the Lipschitz constant. Shameless plug: A trust-region proximal splitting method that my co-authors and I once developed.

Best Answer

Related Solutions

[Math] The mathematical theory of Feynman integrals

[Math] Choice of Lipschitz constant for proximal gradient optimization

Related Question