Explicit relation between the regularization parameters in Ivanov and Tikhonov regularization

convex optimizationmachine learningnonlinear optimizationoptimizationregularization

Consider the class of functions $\mathcal H_\phi $ defined by
$$\mathcal H_\phi :=\left\{h:\mathbb R^d \to \mathbb R\mid h(x) = \langle w,\phi(x)\rangle+b,\ (w,b)\in\mathbb R^D\times\mathbb R\right\} $$
Where $\phi : \mathbb R^d\to\mathbb R^D$ is a mapping function. Assume that we are given an empirical loss function $\hat L$ and we want to solve the following problem
$$h : \arg\min_{\mathcal H_\phi} \hat L(h) \tag 1$$
Solving $(1)$ can be thought of as solving a soft-margin SVM problem.

To solve $(1)$ while avoiding overfitting, a common approach is to add a penalty term on the $\ell_2$ norm of the weight vector $w$, which is called $L_2$ or Tikhonov regularization :
$$h_\lambda : \arg\min_{\mathcal H_\phi} \hat L(h) + \frac\lambda 2\|w\|^2\tag 2 $$
Alternatively, another approach is to directly constrain the weight vector to remain bounded in a certain radius. This is called Ivanov regularization :
$$h_R : \arg\min_{\mathcal H_\phi} \hat L(h),\ \text{ s.t. } \|w\|^2\le R^2\tag 3$$
In their paper Tikhonov, Ivanov and Morozov regularization for support vector machine learning (2016), Oneto et al. proved, under the assumption that $\hat L$ is convex with respect to $w$, that a Tikhonov regularized SVM is equivalent to an Ivanov-regularized SVM, in the following sense :

Theorem : For any $\lambda$, there exists an $R \equiv R(\lambda)\in\mathbb R$ such that problems $(2)$ and $(3)$ have the same solutions (i.e. such that $h_\lambda = h_R$).

Regarding this theorem, I mainly have two questions :

How does $R(\lambda)$ vary with respect to $\lambda$ ? Is it at least continuous ? Is it possible to get any lower bound on it in terms of $\lambda$ ? Intuitively, one would expect that $R(\lambda)\to\infty$ continuously as $\lambda \to 0$, but my attempts to prove it failed, and the proof given in the paper is not constructive so it doesn't help.
Does this result apply to more general families of functions ? In particular, I am interested in Deep Neural Networks. In that case the convexity assumption on $\hat L$ does not hold anymore and it seems unlikely that the authors' approach will work. But intuitively again, one would expect that adding a penalty on the Euclidean norm of the weight parameters would correspond to restricting the parameters to remain in a ball whose radius depends on the strength of the penalty.

I will be grateful for any help or useful references you may provide.

Best Answer

Assume that for $\lambda>0$ problem (2) is solvable with solution $w_\lambda$. If $\hat L$ is convex, then a solution exists and is unique.

Now take $\lambda<\lambda'$, and let $w$ and $w'$ be the associated solutions. Then by optimality $$\begin{split} L(w) + \frac\lambda2\|w\|^2& \le L(w') + \frac\lambda2\|w'\|^2\\ &\le L(w')+ \frac{\lambda'}2\|w'\|^2 + \frac{\lambda-\lambda'}2\|w'\|^2\\ &\le L(w) + \frac{\lambda'}2\|w\|^2 + \frac{\lambda-\lambda'}2\|w'\|^2.\\ \end{split} $$ This shows $$ 0 \le (\lambda - \lambda')( \|w'\|^2 - \|w\|^2) $$ and $\lambda \mapsto \|w_\lambda\|$ is monotonically decreasing, as expected. If the problem has a solution for $\lambda=0$ then we can use the corresponding solution $w_0$ to get $\|w_\lambda\| \le \|w_0\|$.

If $\hat L$ is convex then $\lambda \mapsto \|w_\lambda\|$ is continuous. Let me prove this for differentiable $\hat L$, the same proof works with subgradients as well. Again let $0<\lambda<\lambda'$, and let $w$ and $w'$ be the associated solutions. Then $$ \nabla \hat L(w) + \lambda w = 0 = \nabla \hat L(w') + \lambda'w'. $$ Rearranging and multiplying with $w-w'$ gives $$ \lambda \|w-w'\|^2 = (\nabla \hat L(w') - \nabla \hat L(w))^T (w-w') + (\lambda'-\lambda) (w')^T(w-w'). $$ The first addend on the right-hand side is non-positive due to convexity of $\hat L$, which implies $$ \lambda \|w-w'\|^2 \le (\lambda'-\lambda) (w')^T(w-w'). $$ The rhs tends to zero for $|\lambda - \lambda|\to 0$, and the continuity follows.

To summarize: if we set $R(\lambda):=\|w_\lambda\|$, then $R$ is monotonically decreasing, it is bounded if there is a solution in case $\lambda=0$, it is continuous if $\hat L$ is convex.

Best Answer

Related Solutions

[Math] Connection between Vapnik-Chervonenkis dimension and regularization

[Math] Understanding Regularization parameters in Machine Learning/Statistics

Related Question