Ridge Regression – Applying Ridge Regression for an Underdetermined System of Equations

least squaresregressionregularizationridge regressionunderdetermined

When $y = X\beta + e$, the least squares problem which imposes a spherical restriction $\delta$ on the value of $\beta$ can be written as
\begin{equation}
\begin{array}
&\operatorname{min}\ \| y – X\beta \|^2_2 \\
\operatorname{s.t.}\ \ \|\beta\|^2_2 \le \delta^2
\end{array}
\end{equation}
for an overdetermined system. $\|\cdot\|_2$ is the Euclidean norm of a vector.

The corresponding solution to $\beta$ is given by
\begin{equation}
\hat{\beta} = \left(X^TX + \lambda I\right)^{-1}X^T y \ ,
\end{equation}
which can be derived from the method of Lagrange multipliers ($\lambda$ is the multiplier):
\begin{equation}
\mathcal{L}(\beta,\lambda) = \|y-X\beta\|^2_2 + \lambda(\|\beta\|^2_2 – \delta^2)
\end{equation}

I understand that there is a property that
\begin{equation}
\left(X^TX + \lambda I\right)^{-1}X^T = X^T\left(XX^T + \lambda I\right)^{-1} \ .
\end{equation}
The right hand side resembles the pseudo-inverse of the regressor matrix $X$ in the underdetermined case (with the added regularization parameter, $\lambda$). Does this mean
mean that the same expression can be used to approximate $\beta$ for the underdetermined case? Is there a separate derivation for the corresponding expression in the underdetermined case, as the spherical restriction constraint is redundant with the objective function (minimum norm of $\beta$):

\begin{equation}
\begin{array}
&\operatorname{min.}\ \| \beta \|_2\\
\operatorname{s.t.}\ X\beta = y \ .
\end{array}
\end{equation}

Best Answer

Starting with the formulation of the ridge regression problem as

$ \min \| X\beta -y \|_{2}^{2} + \lambda \| \beta \|_{2}^{2} $

you can write the problem as

$ \min \| A\beta - b \|_{2}^{2} $

where

$ A=\left[ \begin{array}{c} X \\ \sqrt{\lambda} I \end{array} \right] $

and

$b=\left[ \begin{array}{c} y \\ 0 \end{array} \right]. $

The matrix $A$ has full column rank because of the $\sqrt{\lambda}I$ part. Thus the least squares problem as a unique solution

$\hat{\beta}=(A^{T}A)^{-1}A^{T}b$

Writing this out in terms of $X$ and $y$, and simplifying lots of 0's out, we get

$\hat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}y$

Nothing in this derivation depends on whether $X$ has more rows or columns, or even on whether $X$ has full rank. This formula is thus applicable to the undetermined case.

It is an algebraic fact that for $\lambda>0$,

$(X^{T}X+\lambda I)^{-1}X^{T}=X^{T}(XX^{T}+\lambda I)^{-1} $

Thus we also have the option of using

$\hat{\beta}=X^{T}(XX^{T}+\lambda I)^{-1}y$.

To answer your specific questions:

  1. Yes, both formulas work for the undetermined case as well as the over determined case. They also work if $\mbox{rank}(X)$ is less than the minimum of the number of rows and columns of $X$. The second version can be more efficient for problems that are undetermined because $XX^{T}$ is smaller than $X^{T}X$ in that case.

  2. I'm not aware of any derivation of the alternative version of the formula that starts with some other damped least squares problem and uses the normal equations. In any case you can derive it in a straight forward fashion using a bit of algebra.

It's possible that you're thinking of the ridge regression problem in the form

$\min \| \beta \|_{2}^{2} $

subject to

$\| X\beta-y \|_{2}^{2} \leq \epsilon.$

However, this version of the ridge regression problem simply leads to the same damped least squares problem $\min \| X\beta -y \|_{2}^{2} + \lambda \| \beta \|_{2}^{2}$.

Related Question