Solved – Minimum and maximum regularization in L0 (pseudo)norm penalized regression

glmnetlassomachine learningregression

L0-pseudonorm penalized least squares regression (aka best subset regression) solves $\widehat{\beta}(\lambda)$ as
$$\min_\beta \frac{1}{2}||y-X\beta||_2^2 +\lambda||\beta||_0.$$
where $||\beta||_0$ is the number of nonzero coefficients.
I was wondering what would be (1) the minimum value of $\lambda$ that would result in no variables being selected and (2) the maximum value of $\lambda$ that would result in the maximum number of variables being selected, either for the case where the coefficients are unconstrained or where they are nonnegativity constrained (ie required to be all zero or positive as in nnls)? For LASSO regression, where we work with the L1-norm penalty $\lambda||\beta||_1$ I understand that (1) is given by $\lambda_1 = \max_j |X_j^Ty|$, but what would be it's value in case of L0-penalized regression (as implemented in the L0Learn package)?

Example in R:

install.packages("L0Learn")
library(L0Learn)
# Simulate some data
data <- GenSynthetic(n=500,p=100,k=10,seed=1)
X = data$X
y = data$y
# make L0 penalized fit:
system.time(L0fit <- L0Learn.fit(x=X, y=y, penalty="L0", algorithm="CDPSI", nLambda=1000, intercept=F, maxSuppSize = 100))

Maximum lambda that would result in no variables being selected = 0.0618124 :

unlist(L0fit$lambda)[unlist(L0fit$suppSize)==0][1] # = 0.0618124

Lambda that would result in the maximum number of variables (100 here, ie all variables) being selected = 6.5916e-09 :

unlist(L0fit$lambda)[unlist(L0fit$suppSize)==max(unlist(L0fit$suppSize))][1] # = 6.5916e-09
max(unlist(L0fit$suppSize)) # size of largest model = 100

So I'm looking for a way to calculate those two lambda values – here 0.0618124 and 6.5916e-09 – a priori.
For the 0.0618124 I tried with the recipe in the answer below but couldn't quite reproduce this value – instead of 0.0618124 I am getting 677 in my example:

max(diag(1/crossprod(X, X)) * (crossprod(X, y)^2)) # 677.1252

This paper ("Efficient Regularized Regression with L0 Penalty for Variable Selection and Network Construction", 2016, by Liu & Li, page 6) mentions a maximum $\lambda^\star = \max_{i = 1,\dots,p}~ (X^\top_i y)^2 / (4X^\top_i X_i)$ but again that seems to give a different value… Any thoughts?

EDIT: So it seems that L0Learn first centers & L2 norm normalizes both the columns of the design matrix & the outcome variable y. Hence, the maximum lambda that would cause all variables to be penalized, based on the logic in the answers below, in L0Learn is given by

Xcentnorm = apply(X, 2, function (col) (col-mean(col))/norm((col-mean(col)),"2"))
ycentnorm = (y-mean(y))/(norm(y-mean(y),"2"))
max((crossprod(Xcentnorm, ycentnorm)^2)/2) # = 0.06262011

The factor diag(1/crossprod(Xcentnorm, Xcentnorm)) drops out because of the L2 norm normalization (i.e. it would be a vector of 1s).

Best Answer

Assuming the columns have unit L2 norm, the $\lambda^{*}$ which sets all coefficients to zero is given by $\frac{1}{2} \max_{j} (X_j^T y)^2$ (the reasoning in the answer above is correct; but the final answer misses the factor of $\frac{1}{2}$).

L0Learn centers and then normalizes the columns before fitting the model. The $\lambda$'s are reported after centering and normalization. So to reproduce L0Learn's $\lambda^{*}$ you can try centering and then normalizing the columns.

Related Solutions

Solved – How to interpret glmnet

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Solved – Maximum penalty for ridge regression

The effect of $\lambda$ in the ridge regression estimator is that it "inflates" singular values $s_i$ of $X$ via terms like $(s^2_i+\lambda)/s_i$. Specifically, if SVD of the design matrix is $X=USV^\top$, then $$\hat\beta_\mathrm{ridge} = V^\top \frac{S}{S^2+\lambda I} U y.$$ This is explained multiple times on our website, see e.g. @whuber's detailed exposition here: The proof of shrinking coefficients using ridge regression through "spectral decomposition".

This suggests that selecting $\lambda$ much larger than $s_\mathrm{max}^2$ will shrink everything very strongly. I suspect that $$\lambda=\|X\|_2^2=\sum s_i^2$$ will be too big for all practical purposes.

I usually normalize my lambdas by the squared Frobenius norm of $X$ and have a cross-validation grid that goes from $0$ to $1$ (on a log scale).

Having said that, no value of lambda can be seen as truly "maximum", in contrast to the lasso case. Imagine that predictors are exactly orthogonal to the response, i.e. that the true $\beta=0$. Any finite value of $\lambda<\infty $ for any finite value of sample size $n$ will yield $\hat \beta \ne 0$ and hence could benefit from stronger shrinkage.

Best Answer

Related Solutions

Solved – How to interpret glmnet

Solved – Maximum penalty for ridge regression

Related Question