Solved – How to perform non-negative ridge regression

lassoregressionregularizationridge regression

How to perform non-negative ridge regression? Non-negative lasso is available in scikit-learn, but for ridge, I cannot enforce non-negativity of betas, and indeed, I am getting negative coefficients. Does anyone know why this is?

Also, can I implement ridge in terms of regular least squares? Moved this to another question: Can I implement ridge regression in terms of OLS regression?

Best Answer

The rather anti-climatic answer to "Does anyone know why this is?" is that simply nobody cares enough to implement a non-negative ridge regression routine. One of the main reasons is that people have already started implementing non-negative elastic net routines (for example here and here). Elastic net includes ridge regression as a special case (one essentially set the LASSO part to have a zero weighting). These works are relatively new so they have not yet been incorporated in scikit-learn or a similar general use package. You might want to inquire the authors of these papers for code.

EDIT:

As @amoeba and I discussed on the comments the actual implementation of this is relative simple. Say one has the following regression problem to:

$y = 2 x_1 - x_2 + \epsilon, \qquad \epsilon \sim N(0,0.2^2)$

where $x_1$ and $x_2$ are both standard normals such as: $x_p \sim N(0,1)$. Notice I use standardised predictor variables so I do not have to normalise afterwards. For simplicity I do not include an intercept either. We can immediately solve this regression problem using standard linear regression. So in R it should be something like this:

rm(list = ls()); 
library(MASS); 
set.seed(123);
N = 1e6;
x1 = rnorm(N)
x2 = rnorm(N)
y = 2 * x1 - 1 * x2 + rnorm(N,sd = 0.2)

simpleLR = lm(y ~ -1 + x1 + x2 )
matrixX = model.matrix(simpleLR); # This is close to standardised
vectorY = y
all.equal(coef(simpleLR), qr.solve(matrixX, vectorY), tolerance = 1e-7)  # TRUE

Notice the last line. Almost all linear regression routine use the QR decomposition to estimate $\beta$. We would like to use the same for our ridge regression problem. At this point read this post by @whuber; we will be implementing exactly this procedure. In short, we will be augmenting our original design matrix $X$ with a $\sqrt{\lambda}I_p$ diagonal matrix and our response vector $y$ with $p$ zeros. In that way we will be able to re-express the original ridge regression problem $(X^TX + \lambda I)^{-1} X^Ty$ as $(\bar{X}^T\bar{X})^{-1} \bar{X}^T\bar{y}$ where the $\bar{}$ symbolises the augmented version. Check slides 18-19 from these notes too for completeness, I found them quite straightforward. So in R we would some like the following:

myLambda = 100;  
simpleRR = lm.ridge(y ~ -1 + x1 + x2, lambda = myLambda)
newVecY = c(vectorY, rep(0, 2))
newMatX = rbind(matrixX, sqrt(myLambda) * diag(2))
all.equal(coef(simpleRR), qr.solve(newMatX, newVecY), tolerance = 1e-7)  # TRUE

and it works. OK, so we got the ridge regression part. We could solve in another way though, we could formulate it as an optimisation problem where the residual sum of squares is the cost function and then optimise against it, ie. $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$. Sure enough we can do that:

myRSS <- function(X,y,b){ return( sum( (y - X%*%b)^2 ) ) }
bfgsOptim = optim(myRSS, par = c(1,1), X = newMatX, y= newVecY, 
                  method = 'L-BFGS-B')
all.equal(coef(simpleRR), bfgsOptim$par, check.attributes = FALSE, 
          tolerance = 1e-7) # TRUE

which as expected again works. So now we just want : $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$ where $\beta \geq 0$. Which is simply the same optimisation problem but constrained so that the solution are non-negative.

bfgsOptimConst = optim(myRSS, par = c(1,1), X=newMatX, y= newVecY, 
                       method = 'L-BFGS-B', lower = c(0,0))
all(bfgsOptimConst$par >=0)  # TRUE
(bfgsOptimConst$par) # 2.000504 0.000000

which shows that the original non-negative ridge regression task can be solved by reformulating as a simple constrained optimisation problem. Some caveats:

  1. I used (practically) normalised predictor variables. You will need to account of the normalisation yourself.
  2. Same thing goes for the non normalisation of the intercept.
  3. I used optim's L-BFGS-B argument. It is the most vanilla R solver that accepts bounds. I am sure that you will find dozens of better solvers.
  4. In general constraint linear least-squares problems are posed as quadratic optimisation tasks. This is an overkill for this post but keep in mind that you can get better speed if needed.
  5. As mentioned in the comments you could skip the ridge-regression as augmented-linear-regression part and directly encode the ridge cost function as an optimisation problem. This would be a lot simpler and this post significantly smaller. For the sake of argument I append this second solution too.
  6. I am not fully conversational in Python but essentially you can replicate this work by using NumPy's linalg.solve and SciPy's optimize functions.
  7. To pick the hyperparameter $\lambda$ etc. you just do the usual CV-step you would do in any case; nothing changes.

Code for point 5:

myRidgeRSS <- function(X,y,b, lambda){ 
                return( sum( (y - X%*%b)^2 ) + lambda * sum(b^2) ) 
              }
bfgsOptimConst2 = optim(myRidgeRSS, par = c(1,1), X = matrixX, y = vectorY,
                        method = 'L-BFGS-B', lower = c(0,0), lambda = myLambda)
all(bfgsOptimConst2$par >0) # TRUE
(bfgsOptimConst2$par) # 2.000504 0.000000