Solved – How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables

high-dimensionallassorridge regression

I want to use Lasso or ridge regression for a model with more than 50,000 variables. I want do so using software package in R. How can I estimate the shrinkage parameter ($\lambda$)?

Edits:

Here is the point I got up to:

set.seed (123)
Y <- runif (1000)
Xv <- sample(c(1,0), size= 1000*1000,  replace = T)
X <- matrix(Xv, nrow = 1000, ncol = 1000)

mydf <- data.frame(Y, X)

require(MASS)
lm.ridge(Y ~ ., mydf)

plot(lm.ridge(Y ~ ., mydf,
              lambda = seq(0,0.1,0.001)))

enter image description here

My question is: How do I know which $\lambda$ is best for my model?

Best Answer

The function cv.glmnet from the R package glmnet does automatic cross-validation on a grid of $\lambda$ values used for $\ell_1$-penalized regression problems. In particular, for the lasso. The glmnet package also supports the more general elastic net penalty, which is a combination of $\ell_1$ and $\ell_2$ penalization. As of version 1.7.3. of the package taking the $\alpha$ parameter equal to 0 gives ridge regression (at least, this functionality was not documented until recently).

Cross-validation is an estimate of the expected generalization error for each $\lambda$ and $\lambda$ can sensibly be chosen as the minimizer of this estimate. The cv.glmnet function returns two values of $\lambda$. The minimizer, lambda.min, and the always larger lambda.1se, which is a heuristic choice of $\lambda$ producing a less complex model, for which the performance in terms of estimated expected generalization error is within one standard error of the minimum. Different choices of loss functions for measuring the generalization error are possible in the glmnet package. The argument type.measure specifies the loss function.

Alternatively, the R package mgcv contains extensive possibilities for estimation with quadratic penalization including automatic selection of the penalty parameters. Methods implemented include generalized cross-validation and REML, as mentioned in a comment. More details can be found in the package authors book: Wood, S.N. (2006) Generalized Additive Models: an introduction with R, CRC.