Does gradient descent for linear regression select the minimal norm solution

convex optimizationlinear regressionmachine learningoptimizationregression

I was told that Gradient Descent finds the weights of smallest norm.

This is what I understood in the linear regression setting:

$f_w(x)=w^\top x$ are the linear functions $ \mathbb{R}^n \rightarrow \mathbb{R} $.

$ \mathcal{D}=\{(x_i,y_i) \in \mathbb{R}^d\times\mathbb{R} \mid i=1,…,n\} $ is a dataset

Then I choose some loss function $L$ to get a cost function
$ C_0(w)=\frac{1}{n} \sum_i L(f_w(x_i),y_i) $

Finally, I define a regularized cost $ C_\lambda(w)=C_0(w)+\frac{\lambda}{2n}||w||^2 $

Then there must be some sort of equivalence between (and it looks it should be some classical result):

  1. minimizing $C_\lambda(w)$ for an opportune $ \lambda \ge 0 $
  2. minimizing $C_0(w)$ with Gradient Descent

My questions:

  • Can somebody clarify a bit this fact? By providing some hypothesis (e.g. on $L$), some insight, some reference…
  • Does this apply in a similar way to the linear classification setting ($y_i\in\{-1,+1\}$)? In that case, I would expect some equivalence bewteen:
  1. maximizing $M(w)= \min_{i=1…n} y_i \langle w, x_i \rangle $ over $||w||=1$
  2. minimizing $||w||$ over $ y_i \langle w, x_i \rangle \ge 1 $
  3. maybe the previous 1. and 2. points above (those regarding regression)

Best Answer

I will prove that when we initialize the weights with zero and apply Gradient Descent in the limit of infinitesimal stepsize, then the solution is guaranteed to converge against the minimal norm solution.

Some notation

Let the loss function be given as

$$(1) \qquad L_\lambda(w) = \tfrac{1}{2N}\|Xw-y\|_2^2 + \tfrac{1}{2}\lambda\|w\|_2^2$$

With gradient $\nabla_w L_\lambda = (\lambda I + \tfrac{1}{N}X^TX)w - \tfrac{1}{N}X^Ty$ and Hessian $\nabla^2_w L_\lambda = \lambda I + \tfrac{1}{N}X^TX$

Note that $\nabla^2_w L_\lambda$ is positive semi-definite for all $\lambda \ge 0$ and even positive definite if $\lambda>0$. Consequently, $L_{\lambda>0}$ is strongly convex and has a unique global minimum, whereas $L_0$ is only guaranteed to be convex and not strongly convex, and so can potentially have multiple global minima.

This does indeed happen when the columns of $X$ are not linearly independent: To any optimal $w$, we can add an element of $\ker(X)$ and get another optimal $w'$.

Claim 1: For $\lambda>0$, let $\hat w(\lambda) = \arg\min_w L_\lambda(w)$, which is unique by strong convexity. Then

$$(2)\qquad \arg\min_w L_0(w) = \hat w_0 + \ker(X) \qquad\text{where}\qquad \hat w_0 =\lim\limits_{\lambda\searrow 0}\hat w(\lambda)=X^+y $$

Here $X^+$ is the pseudoinverse. We will refer to $\hat w_0$ as the unique minimal norm solution among solutions of the unregularized problem $L_0$. (One can also show $\hat w_0 = \underset{w\in\arg\min(L_0)}{\arg\min}\|w\|_2$)

Proof: Exercise.


Connection of Gradient Descent to Ordinary Differential Equations

Now, a hypothesis we could make is that "under normal circumstances" Gradient Descent (GD) should converge against the minimal norm solution. The question is what do we mean with "normal circumstances"? Well, if we start GD at a point that is already a global minimum other than the minimal norm minimum, then of course GD will never move away from that point. Another thing that can happen of course is non-convergence due to too high step size. To avoid these issue, we will consider GD with infinitesimal stepsize and fixed starting location at the origin. In this case, the iterative scheme is equivalent to the solution of the ODE

Claim 2: GD on $L_\lambda$ with starting point $w_0$ and infinitesimal step size is equivalent to the ODE

$$(3) \qquad \begin{aligned}&\dot w = -\nabla L_{\lambda}(w) \\ &w(0) = w_0\end{aligned} \qquad \nabla L_{\lambda}(w)= \big(\lambda I+\tfrac{1}{N}X^TX\big)w+X^Ty \qquad $$

Proof: Compare GD applied to (1) with the explicit Euler method applied to (3)

This is actually a general fact for any differentiable loss function / model.

Claim 3: The solution of the ODE (3) with initial condition $w_0=0$ converges against $\hat w(\lambda)$ for all $\lambda\ge 0$ as $t\to\infty$. (We will only consider the interesting case $\lambda=0$)

Proof: Recall that a linear ODE $\dot x = Ax+b, x(t_0)=x_0$ with paramters $\theta=(t_0, x_0, A, b)$ has the analytic solution

$$x^*(t, \theta) = e^{A(t-t_0)}x_0 + \frac{e^{A(t-t_0)}- I}{A}b$$

Note that $\frac{e^{A(t-t_0)}- I}{A}$ is only shorthand for the power series $\sum_{n=0}^{\infty} \frac{A^{n}(t-t_0)^{n+1}}{(n+1)!}=\int_{0}^{t} e^{A(s-t)}ds$ which exists and converges even when $A$ is singular. Then, with $t_0=0$, $w_0=0$, $A=-\tfrac{1}{N}X^TX$, $b=\tfrac{1}{N}X^Ty$:

$$ w^*(t) = \frac{I - e^{-\frac{1}{N}X^TX t}}{X^TX}X^Ty $$

So, we are left with showing that $\lim_{t\to\infty} \frac{I-e^{-\frac{1}{N}X^TX}}{X^TX} X^T = X^+$; which we can do by showing the limit satisfies the four Moore-Penrose conditions

$$\begin{aligned} &1.&AA^+A&=A & &3.&(AA^+)^T &= AA^+ \\ &2.&A^+AA^+&=A^+ & &4.&(A^+A)^T &=A^+A \end{aligned}$$

Indeed, we actually already have (I think?) $\lim_{t\to\infty} \frac{I-e^{-At}}{A}=A^+$ whenever all eigenvalues of $A$ have non-negative real part, which we shall show exemplary for the first MP-condition. Note that the matrix exponential satisfies $e^{U^{-1} A U} = U^{-1} e^A U$ and $e^{\operatorname{diag}(v)} = \operatorname{diag}(\exp(v))$ element-wise, hence:

$$\begin{aligned} A \frac{I-e^{-At}}{A} A &= A (I - e^{-At}) = U^{-1} \big(\underbrace{\Lambda (I - e^{-\Lambda t} )}_{\to \Lambda}\big) U \to A \end{aligned}$$

And obviously all eigenvalues of $\tfrac{1}{N}X^TX$ are non-negative reals, so

$$\begin{aligned} \lim_{t\to\infty} w^*(t) = \lim_{t\to\infty}\frac{I - e^{-\frac{1}{N}X^TX t}}{X^TX}X^Ty = (\tfrac{1}{N}X^TX)^+ \tfrac{1}{N}X^Ty = X^+y = \hat w_0 \end{aligned}$$

Q.E.D.

Corollary: If we start at a general point $w_0= v + v^\perp$ where $v\in\ker(X)$ and $v^\perp\in\ker(X)^\perp$, then GD converges against $\hat w_0 + v$. In particular, if $X$ has full column rank then we still have convergence against the minimal norm solution.

Proof: $e^{-X^TXt} = U e^{-\Lambda t} U^T$, thus for any eigenvalue, eigenvector pair $(\lambda_i, u_i)$ of $X^TX$ we have

$$\lim_{t\to\infty}e^{-X^TXt} u_i = \lim_{t\to\infty}e^{-\lambda_i t}u_i=\begin{cases}0:&\lambda_i>0 \\ 1:& \lambda_i=0\end{cases}$$

Consequently, $\lim_{t\to\infty}e^{-X^TXt}=\operatorname{proj}_{\ker(X)}$.


Some final remarks: The connection between gradient based optimizers and ODEs is actually pretty deep, see for example A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights by Sue, Boyd and Candès

Related Question