Regularized optimization : adding a vanishingly small penalty term does not change the solution set

karush kuhn tuckerlagrange multipliernonlinear optimizationoptimizationregularization

Say I am trying to minimize a differentiable function $R: \Theta\to\mathbb R^+$, where $\Theta\subseteq\mathbb R^p $ is a compact subset. Now, for $\lambda\ge0 $, I define the $ \lambda$-regularized objective as
$$ \min_{\theta\in\Theta}\ \underbrace{R(\theta) + \frac\lambda2|\theta|^2 }_{R_\lambda(\theta)} $$
where $|\cdot|$ is the standard Euclidean norm. $R$ is not convex, so the argmin of $R_\lambda$ is not reduced to a single point, but will be a subset of $\Theta$.
I have the following question :

does there exist a $\varepsilon > 0$ such that

$$\arg\min_{\theta\in\Theta} R_\lambda(\theta) =\arg\min_{\theta\in\Theta} R(\theta) \ \text{ for all } \lambda\in[0,\varepsilon] \ ? \tag1$$

(note that these argmins are not singletons). In other words, is adding a sufficiently small constraint "negligible" with respect to the objective function ?

Beyond empirical evidence and intuition, a reason why I expect this to be true is that there is a provable "equivalence" between explicit and implicit regularization.
That is, based on a Lagrangian duality argument, one can show that minimizing the $\lambda $-regularized objective is equivalent to minimizing the non-regularized objective over $\Theta \cap B_{r(\lambda)}(0) $, where $B_{r(\lambda)}(0)$ denotes the open ball of radius $r(\lambda)$ centered at the origin, and $r(\lambda)$ is a non-increasing function of $\lambda$. (See also here and here)
From this observation, we can see that adding a regularization is effectively equivalent to constraining the parameter space to remain within a certain ball, whose radius decreases as the penalty term increases. Hence, for small values of $\lambda>0$, the constraints should not be active and the solution set remains unchanged.

Based on this idea, I managed to piece together a sketchy argument of why $(1)$ is true based on Lagrangian duality. However it is quite handwavy and I admit not being 100% sure it is correct. Does someone see how to rigorously prove/disprove the statement ?

Alternatively, I would be happy with a rigorous proof of the following, weaker statement :

for all $t>0$, there exists $\varepsilon >0$ such that
$$\theta_\lambda\in\arg\min_{\theta \in \Theta} R_\lambda (\theta)\implies R(\theta_\lambda) \le \min_{\theta\in \Theta} R(\theta) + t\ \text{ for all } \lambda\in[0,\varepsilon] \tag2 $$

Any hint or reference will be welcome. Thanks in advance !

Best Answer

Conjecture (1) is false. Take $p = 1$ and $\Theta = [0,1] \subset \mathbb{R}$. Take $R(\theta) = 0$. Then the set of minimizers of $R$ is $[0,1]$ while, for any $\lambda > 0$, the set of minimizers of $R_\lambda$ is the singleton $\{ 0 \}$.

Obviously, this is a pathological example, but you can make it less artificial. The key point is that very "flat" regions of $R$ will prevent your property to hold. For example, with $R(\theta) := (1-\theta)^4$, the only minimizer of $R$ is $\theta = 1$, while for any $\lambda > 0$, there is a unique minimizer of $R_\lambda$ and it is $< 1$.

Conjecture (2) is correct. Your weaker statement seems however correct and easy to prove. Let $\bar{\theta}$ be a minimizer of $R$. For any minimizer $\theta_\lambda$ of $R_\lambda$, $$ R_\lambda(\theta_\lambda) \leq R_\lambda(\bar{\theta}) = R(\bar{\theta}) + \frac{\lambda}{2} |\bar{\theta}|^2. $$ Hence $$ R(\theta_\lambda) = R_\lambda(\theta_\lambda) - \frac{\lambda}{2}|\theta_\lambda|^2 \leq R_\lambda(\theta_\lambda) \leq R(\bar{\theta}) + \frac{\lambda}{2} |\bar{\theta}|^2 = \min R + \frac{\lambda}{2} |\bar{\theta}|^2. $$ Hence, choosing $\varepsilon$ such that $\frac{\varepsilon}{2} M^2 \leq t$ where $M>0$ is such that $\Theta \subset B(0,M)$ satisfies your property.

Best Answer

Related Solutions

[Math] Understanding the difference between Ridge and LASSO

Correct way to interpret notation used in optimization problem

Related Question