Solved – Ridge Regression with R

rregressionridge regression

I have a problem with computing the ridge regression estimator with R.

In order to calculate the regression estimator of a data set, I created three samples of size 10.

V = c(4, 3, 10, 1, 7, 10, 2, 6, 1, 4)
W = c(16, 11, 16, 13, 12, 20, 11, 20, 16, 20)
Y = c(26, 28, 27, 29, 24, 26, 22, 23, 28, 23)

Now, one can easily determine the OLS estimators with the command:

lm(Y~V+W)

which gives:

Coefficients:
(Intercept)      V            W  
27.70294     -0.06096     -0.11679  

However, I also calculated the estimators manually with the OLS – formula
$$(X'X)^{-1}X'Y $$

X <- cbind(rep(1,10),V,W)

solve(t(X)%*%X)%*%t(X)%*%Y

yielding the same results.


I wanted to repeat this with the ridge regression.

library(MASS)

lm.ridge(Y~V+W,lambda=0.1)

which yiels:

                  V           W 
27.68511898 -0.06088623 -0.11566871 

These results are quite similiar to the OLS results, which makes sense, considering that λ is only 0.1.

Since the ridge estimator is $$(X'X + λI)^{-1}X'Y$$,

, I only the changed the upper expression to:

solve(t(X)%*%X+0.1*diag(3))%*%t(X)%*%Y

This yiels, however, quite different results:

  22.86576934
V -0.09884927
W  0.19226126


Does anyone know, what I am doing wrong?


Thank you very much for your quick reply!

I was aware that my text was quite difficult to read, but didn't know how to make it better. The 4 leading spaces advice is quite valuable, indeed.

What you say about removing the penalty from the intercept makes perfect sense to me. So λ is not to be multiplied by the identity matrix, but by the slightly altered identity matrix, whose first row consist of zeros only.

However, I still quite don't understand why the λ should be squared. I thought the original ridge-regression-equation is:
$$(Y-Xβ)'(Y-Xβ) + λβ'β = Y'Y-2β'X'Y+β'X'Xβ+λβ'β$$
Taking the first derivate and equating with 0 yields:
$$-2X'Y+2X'Xβ+2λβ = 0 <=> X'Xβ+λβ=X'Y$$
$$ <=> (X'X+λI)β=X'Y <=> {\hat \beta}= (X'X+λI)X'Y$$
So I thought, in ridge regression the β are squared, but the λ are not, in contrast to Lasso, where we take the modulus of the β-vector.

Applying both your squared code and the not squared code, gives, however, in both cases slightly different result than using the inbuilt lm.ridge -function:

lm.ridge(Y~V+W,lambda=0.1)
                 V           W 
27.68511898 -0.06088623 -0.11566871

lambda <-  cbind(0, rbind(0, diag(2)))
solve(t(X) %*% X + 0.1*lambda) %*% t(X) %*% Y
[,1]
  27.70146491
V -0.06094557
W -0.11670492

solve(t(X) %*% X + 0.1^2*lambda) %*% t(X) %*% Y
[,1]
  27.70279419
V -0.06096134
W -0.11678579

So, altogether, I am still a little bit confused.

Best Answer

You need to standardize $X$ before applying the penalty, $\lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge.

Something like:

r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda
as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X

where X is the original model matrix excluding the intercept. Xs is X standardized to have unit variance and sd_X is vector of standard deviations of variables in X.