Explicit relation between the regularization parameters in Ivanov and Tikhonov regularization

convex optimizationmachine learningnonlinear optimizationoptimizationregularization

Consider the class of functions $\mathcal H_\phi $ defined by
$$\mathcal H_\phi :=\left\{h:\mathbb R^d \to \mathbb R\mid h(x) = \langle w,\phi(x)\rangle+b,\ (w,b)\in\mathbb R^D\times\mathbb R\right\} $$
Where $\phi : \mathbb R^d\to\mathbb R^D$ is a mapping function. Assume that we are given an empirical loss function $\hat L$ and we want to solve the following problem
$$h : \arg\min_{\mathcal H_\phi} \hat L(h) \tag 1$$
Solving $(1)$ can be thought of as solving a soft-margin SVM problem.

To solve $(1)$ while avoiding overfitting, a common approach is to add a penalty term on the $\ell_2$ norm of the weight vector $w$, which is called $L_2$ or Tikhonov regularization :
$$h_\lambda : \arg\min_{\mathcal H_\phi} \hat L(h) + \frac\lambda 2\|w\|^2\tag 2 $$
Alternatively, another approach is to directly constrain the weight vector to remain bounded in a certain radius. This is called Ivanov regularization :
$$h_R : \arg\min_{\mathcal H_\phi} \hat L(h),\ \text{ s.t. } \|w\|^2\le R^2\tag 3$$
In their paper Tikhonov, Ivanov and Morozov regularization for support vector machine learning (2016), Oneto et al. proved, under the assumption that $\hat L$ is convex with respect to $w$, that a Tikhonov regularized SVM is equivalent to an Ivanov-regularized SVM, in the following sense :

Theorem : For any $\lambda$, there exists an $R \equiv R(\lambda)\in\mathbb R$ such that problems $(2)$ and $(3)$ have the same solutions (i.e. such that $h_\lambda = h_R$).

Regarding this theorem, I mainly have two questions :

  • How does $R(\lambda)$ vary with respect to $\lambda$ ? Is it at least continuous ? Is it possible to get any lower bound on it in terms of $\lambda$ ? Intuitively, one would expect that $R(\lambda)\to\infty$ continuously as $\lambda \to 0$, but my attempts to prove it failed, and the proof given in the paper is not constructive so it doesn't help.
  • Does this result apply to more general families of functions ? In particular, I am interested in Deep Neural Networks. In that case the convexity assumption on $\hat L$ does not hold anymore and it seems unlikely that the authors' approach will work. But intuitively again, one would expect that adding a penalty on the Euclidean norm of the weight parameters would correspond to restricting the parameters to remain in a ball whose radius depends on the strength of the penalty.

I will be grateful for any help or useful references you may provide.

Best Answer

Assume that for $\lambda>0$ problem (2) is solvable with solution $w_\lambda$. If $\hat L$ is convex, then a solution exists and is unique.

Now take $\lambda<\lambda'$, and let $w$ and $w'$ be the associated solutions. Then by optimality $$\begin{split} L(w) + \frac\lambda2\|w\|^2& \le L(w') + \frac\lambda2\|w'\|^2\\ &\le L(w')+ \frac{\lambda'}2\|w'\|^2 + \frac{\lambda-\lambda'}2\|w'\|^2\\ &\le L(w) + \frac{\lambda'}2\|w\|^2 + \frac{\lambda-\lambda'}2\|w'\|^2.\\ \end{split} $$ This shows $$ 0 \le (\lambda - \lambda')( \|w'\|^2 - \|w\|^2) $$ and $\lambda \mapsto \|w_\lambda\|$ is monotonically decreasing, as expected. If the problem has a solution for $\lambda=0$ then we can use the corresponding solution $w_0$ to get $\|w_\lambda\| \le \|w_0\|$.

Now set $R(\lambda):=\|w_\lambda\|$. Take $w$ such that $\|w\|\le R(\lambda)$. Then by optimality of $w_\lambda$ we have $$ \hat L(w(\lambda)) + \frac\lambda 2\|w_\lambda\|^2 \le \hat L(w) + \frac\lambda 2\|w\|^2 \le \hat L(w) + \frac\lambda 2\|w_\lambda\|^2, $$ that is, $w_\lambda$ solves (3).

If $\hat L$ is convex then $\lambda \mapsto \|w_\lambda\|$ is continuous. Let me prove this for differentiable $\hat L$, the same proof works with subgradients as well. Again let $0<\lambda<\lambda'$, and let $w$ and $w'$ be the associated solutions. Then $$ \nabla \hat L(w) + \lambda w = 0 = \nabla \hat L(w') + \lambda'w'. $$ Rearranging and multiplying with $w-w'$ gives $$ \lambda \|w-w'\|^2 = (\nabla \hat L(w') - \nabla \hat L(w))^T (w-w') + (\lambda'-\lambda) (w')^T(w-w'). $$ The first addend on the right-hand side is non-positive due to convexity of $\hat L$, which implies $$ \lambda \|w-w'\|^2 \le (\lambda'-\lambda) (w')^T(w-w'). $$ The rhs tends to zero for $|\lambda - \lambda|\to 0$, and the continuity follows.

To summarize: if we set $R(\lambda):=\|w_\lambda\|$, then $R$ is monotonically decreasing, it is bounded if there is a solution in case $\lambda=0$, it is continuous if $\hat L$ is convex.