Rewriting the Ridge Regression coefficients

machine learningregressionridge regressionweights

In Ridge Regression we try to find the minimum of the following loss function:

$$\text{min}_w\mathcal{L}_{\lambda}(w,S)=\text{min}\lambda\|w\|^2+\sum^l_{i=1}(y_i-g(x_i))^2$$

Where:

$\lambda$ is a positive number that defines the relative trade-off betweeen norm and loss
$\mathcal{L}$ is the loss function
$w\in\mathbb{R}^n$ is the vector of weights
$g(x_i)$ is the predicted value of observation $x_i$

Taking the derivative of the cost function with respect to the parameters we obtain the equations (*)

$$X'Xw+\lambda w=(X'X+\lambda I_n)w=X'y$$

Where:

$I_n$ is the $n\times n$ identity matrix
$X\in \mathbb{R}^{l\times n}$ is the data matrix
$X'$ is the transpose of $X$

The solution to the above equation is

$$w=(X'X+\lambda I_n)^{-1}X'y$$

Now, my book says that we can rewrite equations (*) in terms of $w$:

$$w=\lambda^{-1}X'(y-Xw)=X'\alpha$$

showing that $w$ can be written as a linear combination of the training points $w=\sum^l_{i=1}\alpha_ix_i$ with $\alpha=\lambda^{-1}(y-Xw)$

I have a hard time understanding how is $w=\lambda^{-1}X'(y-Xw)$ derived. Can someone show this algebraically?

Best Answer

Just:

$X'y = X'Xw + \lambda w $

$X'y - X'Xw = \lambda w $

$X'(y - Xw) = \lambda w $

$w = \lambda^{-1}X'(y - Xw) $

$w = X'\alpha $ with $\alpha=\lambda^{-1}(y - Xw) $

Related Solutions

Solved – Support vector regression versus kernel ridge regression

As expected: it depends on what you want. In terms of generalization performance, typically the performance differences are minor.

That said, minimizing the $l_1$-norm has the extremely attractive feature of yielding sparse solutions (the support vectors are a subset of the training set). When doing ridge regression, just like in least-squares SVM, all training instances become support vectors and you end up with a model the size of your training set. A large model requires a lot of memory (obviously) and is slower in prediction.

Solved – Ridge Regression with R

You need to standardize $X$ before applying the penalty, $\lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge.

Something like:

r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda
as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X

where X is the original model matrix excluding the intercept. Xs is X standardized to have unit variance and sd_X is vector of standard deviations of variables in X.

Related Question