Solved – Minimum and maximum regularization in L0 (pseudo)norm penalized regression

glmnetlassomachine learningregression

L0-pseudonorm penalized least squares regression (aka best subset regression) solves $\widehat{\beta}(\lambda)$ as
$$\min_\beta \frac{1}{2}||y-X\beta||_2^2 +\lambda||\beta||_0.$$
where $||\beta||_0$ is the number of nonzero coefficients.
I was wondering what would be (1) the minimum value of $\lambda$ that would result in no variables being selected and (2) the maximum value of $\lambda$ that would result in the maximum number of variables being selected, either for the case where the coefficients are unconstrained or where they are nonnegativity constrained (ie required to be all zero or positive as in nnls)? For LASSO regression, where we work with the L1-norm penalty $\lambda||\beta||_1$ I understand that (1) is given by $\lambda_1 = \max_j |X_j^Ty|$, but what would be it's value in case of L0-penalized regression (as implemented in the L0Learn package)?

Example in R:

install.packages("L0Learn")
library(L0Learn)
# Simulate some data
data <- GenSynthetic(n=500,p=100,k=10,seed=1)
X = data$X
y = data$y
# make L0 penalized fit:
system.time(L0fit <- L0Learn.fit(x=X, y=y, penalty="L0", algorithm="CDPSI", nLambda=1000, intercept=F, maxSuppSize = 100)) 

Maximum lambda that would result in no variables being selected = 0.0618124 :

unlist(L0fit$lambda)[unlist(L0fit$suppSize)==0][1] # = 0.0618124

Lambda that would result in the maximum number of variables (100 here, ie all variables) being selected = 6.5916e-09 :

unlist(L0fit$lambda)[unlist(L0fit$suppSize)==max(unlist(L0fit$suppSize))][1] # = 6.5916e-09
max(unlist(L0fit$suppSize)) # size of largest model = 100

So I'm looking for a way to calculate those two lambda values – here 0.0618124 and 6.5916e-09 – a priori.
For the 0.0618124 I tried with the recipe in the answer below but couldn't quite reproduce this value – instead of 0.0618124 I am getting 677 in my example:

max(diag(1/crossprod(X, X)) * (crossprod(X, y)^2)) # 677.1252

This paper ("Efficient Regularized Regression with L0 Penalty for Variable Selection and Network Construction", 2016, by Liu & Li, page 6) mentions a maximum $\lambda^\star = \max_{i = 1,\dots,p}~ (X^\top_i y)^2 / (4X^\top_i X_i)$ but again that seems to give a different value… Any thoughts?

EDIT: So it seems that L0Learn first centers & L2 norm normalizes both the columns of the design matrix & the outcome variable y. Hence, the maximum lambda that would cause all variables to be penalized, based on the logic in the answers below, in L0Learn is given by

Xcentnorm = apply(X, 2, function (col) (col-mean(col))/norm((col-mean(col)),"2"))
ycentnorm = (y-mean(y))/(norm(y-mean(y),"2"))
max((crossprod(Xcentnorm, ycentnorm)^2)/2) # = 0.06262011

The factor diag(1/crossprod(Xcentnorm, Xcentnorm)) drops out because of the L2 norm normalization (i.e. it would be a vector of 1s).

Best Answer

Assuming the columns have unit L2 norm, the $\lambda^{*}$ which sets all coefficients to zero is given by $\frac{1}{2} \max_{j} (X_j^T y)^2$ (the reasoning in the answer above is correct; but the final answer misses the factor of $\frac{1}{2}$).

L0Learn centers and then normalizes the columns before fitting the model. The $\lambda$'s are reported after centering and normalization. So to reproduce L0Learn's $\lambda^{*}$ you can try centering and then normalizing the columns.

Related Question