Solved – Naive Ridge Regression in R

rregressionridge regression

I'm trying to learn some basic Machine Learning and some basic R. I have made a very naive implementation of $L_2$ regularization in R based on the formula:

$\hat w^{ridge} = (X^TX +\lambda I)^{-1} X^T y$

My code looks like this:

fitRidge <- function(X, y, lambda) {
     # Add intercept column to X:
  X <- cbind(1, X)
     # Calculate penalty matrix:
  lambda.diag <- lambda * diag(dim(X)[2])
     # Apply formula for Ridge Regression:
  return(solve(t(X) %*% X + lambda.diag) %*% t(X) %*% y)
}

Note that I'm not yet trying to find an optimal $\lambda$, I'm simply estimating $\hat w^{ridge}$ for a given $\lambda$. However, something seems off. When I enter $\lambda = 0$ I get the expected OLS result. I checked this by applying lm.ridge(lambda = 0) on the same dataset and it gives me the same coefficients. However, when I input any other penalty, like $\lambda=2$ or $\lambda=5$ my coefficients and the coefficients given by lm.ridge disagree wildly. I tried looking at the implementation of lm.ridge but I couldn't work out what it does (and therefore what it does differently).

Could anyone explain why there is a difference between my results and the results from lm.ridge? Am I doing something wrong in my code? I've tried playing around with scale() but couldn't find an answer there.

EDIT:

To see what happens, run the following:

library(car)
X.prestige <- as.matrix.data.frame(Prestige[,c(1,2,3,5)])
y.prestige <- Prestige[,4]

fitRidge(X.prestige, y.prestige, 0)
coef(lm.ridge(formula = prestige~education+income+women+census, data = Prestige, lambda = 0))
fitRidge(X.prestige, y.prestige, 2)
coef(lm.ridge(formula = prestige~education+income+women+census, data = Prestige, lambda = 2))

EDIT2:

Okay, so based on responses below, I've gotten a somewhat clearer understanding of the problem. I've also closely re-read the section about RR in TESL by Hastie, Tibshirani and Friedman, where I discovered that the intercept is often estimated simply as the mean of the response. It seems that many sources on RR online are overly vague. I actually suspect many writers have never implemented RR themselves and might not have realized some important things as many of them leave out 3 important facts:

Intercept is not penalized in the normal case, the formula above only applies to the other coefficients.
RR is not equivariant under scaling, i.e. different scales gives different results even for the same data.
Following from 1, how one actually estimates intercept.

I tried altering my function accordingly:

fitRidge <- function(X, Y, lambda) {
  # Standardize X and Y
  X <- scale(X)
  Y <- scale(Y)
  # Generate penalty matrix
  penalties <- lambda * diag(ncol(X))
  # Estimate intercept
  inter <- mean(Y)
  # Solve ridge system
  coeff <- solve(t(X) %*% X + penalties, t(X) %*% Y)
  # Create standardized weight vector
  wz <- c(inter, coeff )
  return(wz)
}

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

Any advice?

Best Answer

First, very simply, I don't think your call to solve looks right, this is what I would expect

solve(t(X) %*% X + lambda.diag, t(X) %*% y)

Your code seems to be explicitly calculating a matrix inverse and then multiplying. This is mathematically correct, but computationally incorrect. It is always better to solve the system of equations. I've gotten in the habit of reading equations like $y = X^{-1}z$ as "solve the system of equations $Xy = z$ for $y$."

On a more mathematical note, you should not have to include an intercept term when fitting a ridge regression.

It is very important, when applying penalized methods, to standardize your data (as you point out with your comment about scale. It's also important to realize that penalties are not generally applied to the intercept term, as this would cause the model to violate the attractive property that the average predictions equal the average response (on the training data).

Together, these facts (centered data, no intercept penalty) imply that the intercept parameter estimate in a ridge regression is known a priori, it is zero.

The coefficient vector from ridge regression is the solution to the penalized optimization problem

$$ \beta = argmin \left( (y - X\beta)^t (y - X\beta) + \frac{1}{2}\sum_{j > 0} \beta_j^2 \right) $$

Taking a partial with respect to the intercept parameter

$$ \frac{\partial L}{\partial \beta_0} = \sum_{i=1}^{n} \left( y - \sum_{j=0}^q \beta_j x_{ij} \right) x_{i0} $$

But $x_{0i}$ are the entries in the model matrix corresponding to the intercept, so $x_{0i} = 1$ always. So we get

$$\sum_{i=1} y_i + \sum_{j=0}^q \beta_j \sum_{i=1}^n x_{ij} $$

The first term, with the sum over y, is zero because $y$ is centered (or not, a good check of understanding is to work out what happens if you don't center y). In the second term, each predictor is centered, so the sum over $i$ is zero for every predictor $j$ except the intercept. For the intercept, the second term $i$ sum comes out to $n$ (it's $1 + 1 + 1 + \cdots$). So this whole thing reduces to

$$ n \beta_0 $$

Setting this partial equal to zero, $n\beta_0 = 0$, we recover $\beta_0 = 0$, as expected.

So, you do not need to bind on an intercept term to your model matrix. Your function should either expect standardized data (and if you plan on making it public, it should check this is so), or standardize the data itself. Once this is done, the intercept is known to be zero. I'll leave it as an exersize to work out what the intercept should be when you translate the coefficients back to the un-normalized scale.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

It's a bit more complicated, but not too bad if you are careful. Here's a place where I answered a very similar question:

GLMnet - “Unstandardizing” Linear Regression Coefficients

You may have to make a very simple change if you are not standardizing $y$.

Related Solutions

Solved – Positive linear regression coefficient

It is often the case that suppressing the intercept leads to regression coefficients that don't make sense. In my experience, there are rarely cases where suppressing the intercept makes sense, even if the scientific plausibility suggests that it might be justifiable (such as stopping distance versus cruising speed or creatinine clearance versus kidney mass in grams: you LEAVE the intercept IN with such analyses!). This is a problem of extrapolation.

Just eyeballing these data, I imagine that the estimated intercept would be a largely non-zero value. Since these data appear to come from some sort of computing time, comparing flops versus elapsed time, etc. the non-zero intercept could have a host of interpretations such as a boot time for running a process, a system lag as memory is allocated for an operation, or any other non-neglible system processes that aren't measured as part of an experimental run. Furthermore, and more subtle, there may be non-linear effects which are influencing your results. The regression coefficient from intercept-in OLS still provides a great way of estimating the first order linear trend through those data, even if the trend is curvilinear... only when you leave the intercept IN.

My first recommendation is to look at the output from running pairs(fit). And just look at the trend.

Nonetheless, if your goal is to simply find optimal positive coefficients in the model, you can do so with using by-hand optimization, either ML or Gibb's sampling, though don't be surprised if those results make no sense. Example of by-hand optimization:

X <- model.matrix(~ tinst+tmem+tcom-1, data=fit)
y <- fit$tcyc
negLogLik <- function(b) {
  b <- exp(b)     ## restrict to positive only values
  yhat <- b %*% X ## calculate fitted
  -var(y-yhat)    ## objective foo
}

nlm(negLogLik, c(1,1,1)) ## minimize objective foo

Solved – Ridge Regression with R

You need to standardize $X$ before applying the penalty, $\lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge.

Something like:

r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda
as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X

where X is the original model matrix excluding the intercept. Xs is X standardized to have unit variance and sd_X is vector of standard deviations of variables in X.

Best Answer

Related Solutions

Solved – Positive linear regression coefficient

Solved – Ridge Regression with R

Related Question