[Math] Choice of Lipschitz constant for proximal gradient optimization

fa.functional-analysisoc.optimization-and-controlst.statistics

I'm trying to use proximal gradient methods (Forward-Backwards Splitting and FISTA) to minimize a function

$f(B) = \frac{1}{2}|| XB – Y||_F^2 + \frac{\gamma}{2}||B C^T||_F^2$, where $X \in \mathbb{R}^{N \times p}, Y \in \mathbb{R}^{N \times K}, B \in \mathbb{R}^{p \times K}, C \in \mathbb{R}^{E \times K}.$

I know this can be done with gradient descent but I'd like to use proximal methods anyway for didactic purposes. Since the entire function is twice differentiable, I don't care about the actual prox operator at the moment.

Now, the basic proximal methods require computing the Lipschitz constant L. I have seen different versions of this, e.g.

$L = ||X^T X||_2^2 \;||H_w \psi||_2^2$

where $||\cdot||_2^2$ is squared spectral norm of the matrix and $H$ is Hessian of the function $f(w) =\psi(Xw)$ (http://www.di.ens.fr/~fbach/bach_jenatton_mairal_obozinski_FOT.pdf, p. 43). I've seen other formulations in other papers, which is confusing.

Is there a "canonical" way to derive a "good" Lipschitz constant for these sort of functions?

Best Answer

In practice, you would not want to run a vanilla prox-gradient method that requires knowledge of the Lipschitz constant. Instead, you'd use a method that combines line-search (these notes give a nice, quick overview, along with pseudocode). More careful versions of FISTA-style and other prox-gradient methods exist, that do not require knowledge of the Lipschitz constant. Shameless plug: A trust-region proximal splitting method that my co-authors and I once developed.

Related Question