Solved – Naive Ridge Regression in R

rregressionridge regression

I'm trying to learn some basic Machine Learning and some basic R. I have made a very naive implementation of $L_2$ regularization in R based on the formula:

$\hat w^{ridge} = (X^TX +\lambda I)^{-1} X^T y$

My code looks like this:

fitRidge <- function(X, y, lambda) {
     # Add intercept column to X:
  X <- cbind(1, X)
     # Calculate penalty matrix:
  lambda.diag <- lambda * diag(dim(X)[2])
     # Apply formula for Ridge Regression:
  return(solve(t(X) %*% X + lambda.diag) %*% t(X) %*% y)
}

Note that I'm not yet trying to find an optimal $\lambda$, I'm simply estimating $\hat w^{ridge}$ for a given $\lambda$. However, something seems off. When I enter $\lambda = 0$ I get the expected OLS result. I checked this by applying lm.ridge(lambda = 0) on the same dataset and it gives me the same coefficients. However, when I input any other penalty, like $\lambda=2$ or $\lambda=5$ my coefficients and the coefficients given by lm.ridge disagree wildly. I tried looking at the implementation of lm.ridge but I couldn't work out what it does (and therefore what it does differently).

Could anyone explain why there is a difference between my results and the results from lm.ridge? Am I doing something wrong in my code? I've tried playing around with scale() but couldn't find an answer there.

EDIT:

To see what happens, run the following:

library(car)
X.prestige <- as.matrix.data.frame(Prestige[,c(1,2,3,5)])
y.prestige <- Prestige[,4]

fitRidge(X.prestige, y.prestige, 0)
coef(lm.ridge(formula = prestige~education+income+women+census, data = Prestige, lambda = 0))
fitRidge(X.prestige, y.prestige, 2)
coef(lm.ridge(formula = prestige~education+income+women+census, data = Prestige, lambda = 2))

EDIT2:

Okay, so based on responses below, I've gotten a somewhat clearer understanding of the problem. I've also closely re-read the section about RR in TESL by Hastie, Tibshirani and Friedman, where I discovered that the intercept is often estimated simply as the mean of the response. It seems that many sources on RR online are overly vague. I actually suspect many writers have never implemented RR themselves and might not have realized some important things as many of them leave out 3 important facts:

  1. Intercept is not penalized in the normal case, the formula above only applies to the other coefficients.
  2. RR is not equivariant under scaling, i.e. different scales gives different results even for the same data.
  3. Following from 1, how one actually estimates intercept.

I tried altering my function accordingly:

fitRidge <- function(X, Y, lambda) {
  # Standardize X and Y
  X <- scale(X)
  Y <- scale(Y)
  # Generate penalty matrix
  penalties <- lambda * diag(ncol(X))
  # Estimate intercept
  inter <- mean(Y)
  # Solve ridge system
  coeff <- solve(t(X) %*% X + penalties, t(X) %*% Y)
  # Create standardized weight vector
  wz <- c(inter, coeff )
  return(wz)
}

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

Any advice?

Best Answer

First, very simply, I don't think your call to solve looks right, this is what I would expect

solve(t(X) %*% X + lambda.diag, t(X) %*% y)

Your code seems to be explicitly calculating a matrix inverse and then multiplying. This is mathematically correct, but computationally incorrect. It is always better to solve the system of equations. I've gotten in the habit of reading equations like $y = X^{-1}z$ as "solve the system of equations $Xy = z$ for $y$."

On a more mathematical note, you should not have to include an intercept term when fitting a ridge regression.

It is very important, when applying penalized methods, to standardize your data (as you point out with your comment about scale. It's also important to realize that penalties are not generally applied to the intercept term, as this would cause the model to violate the attractive property that the average predictions equal the average response (on the training data).

Together, these facts (centered data, no intercept penalty) imply that the intercept parameter estimate in a ridge regression is known a priori, it is zero.

The coefficient vector from ridge regression is the solution to the penalized optimization problem

$$ \beta = argmin \left( (y - X\beta)^t (y - X\beta) + \frac{1}{2}\sum_{j > 0} \beta_j^2 \right) $$

Taking a partial with respect to the intercept parameter

$$ \frac{\partial L}{\partial \beta_0} = \sum_{i=1}^{n} \left( y - \sum_{j=0}^q \beta_j x_{ij} \right) x_{i0} $$

But $x_{0i}$ are the entries in the model matrix corresponding to the intercept, so $x_{0i} = 1$ always. So we get

$$\sum_{i=1} y_i + \sum_{j=0}^q \beta_j \sum_{i=1}^n x_{ij} $$

The first term, with the sum over y, is zero because $y$ is centered (or not, a good check of understanding is to work out what happens if you don't center y). In the second term, each predictor is centered, so the sum over $i$ is zero for every predictor $j$ except the intercept. For the intercept, the second term $i$ sum comes out to $n$ (it's $1 + 1 + 1 + \cdots$). So this whole thing reduces to

$$ n \beta_0 $$

Setting this partial equal to zero, $n\beta_0 = 0$, we recover $\beta_0 = 0$, as expected.

So, you do not need to bind on an intercept term to your model matrix. Your function should either expect standardized data (and if you plan on making it public, it should check this is so), or standardize the data itself. Once this is done, the intercept is known to be zero. I'll leave it as an exersize to work out what the intercept should be when you translate the coefficients back to the un-normalized scale.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

It's a bit more complicated, but not too bad if you are careful. Here's a place where I answered a very similar question:

GLMnet - “Unstandardizing” Linear Regression Coefficients

You may have to make a very simple change if you are not standardizing $y$.

Related Question