Solved – How to Apply the Iteratively Reweighted Least Squares (IRLS) Method to the LASSO Model

convexfeature selectiongeneralized linear modellassologistic

I have programmed a logistic regression using the IRLS algorithm. I would like to apply a LASSO penalization in order to automatically select the right features. At each iteration, the following is solved:

$$\mathbf{\left(X^TWX\right) \delta\hat\beta=X^T\left(y-p\right)}$$

Let $\lambda$ be a non-negative real number. I am not penalizing the intercept as suggested in The Elements of. Statistical Learning. Ditto for the already zero coefficients. Otherwise, I subtract a term from the right-hand side:

$$\mathbf{X^T\left(y-p\right)-\lambda\times \mathrm{sign}\left(\hat\beta\right)}$$

However, I am unsure about the modification of the IRLS algorithm. Is it the right way to do?

Edit: Although I was not confident about it, here is one of the solutions I finally came up with. What is interesting is this solution corresponds to what I now understand about LASSO. There are indeed two steps at each iteration instead of merely one:

the first step is the same as before : we make an iteration of the algorithm (as if $\lambda=0$ in the formula for the gradient above),
the second step is the new one: we apply a soft-thresholding to each component (except for the component $\beta_0$, which corresponds to the intercept) of the vector $\beta$ obtained at the first step. This is called Iterative Soft-Thresholding Algorithm.

$$\forall i \geq 1, \beta_{i}\leftarrow\mathrm{sign}\left(\beta_{i}\right)\times\max\left(0,\,\left|\beta_{i}\right|-\lambda\right)$$

Best Answer

This problem is typically solved by fit by coordinate descent (see here). This method is both safer more efficient numerically, algorithmically easier to implement and applicable to a more general array of models (also including Cox regression). An R implementation is available in the R package glmnet. The codes are open source (partly in and in C, partly in R), so you can use them as blueprints.

Related Solutions

Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)

...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero and in general they are not consistent [Added Note: This means that, as the sample size grows, the coefficient estimates do not converge]. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un-restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since the variables in the second step have less "competition" from noise variables, cross-validation will tend to pick a smaller value for $\lambda$ [the penalty parameter], and hence their coefficients will be shrunken less than those in the initial estimate.

Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.

Solved – How to find lasso beta estimates

First off, check your rSquared calculation. I think it should be

(y - np.dot(Xb, B))**2 not (y - np.dot(Xb.T, B))**2.

As I said, a closed form solution is probably not how you want to solve this. Use a minimization routine. Here is one option:

import numpy as np

np.random.seed(4)
X = np.array([np.random.normal(size=5) for x in range(6)])
y = np.random.normal(size=6)

ones = [1. for x in range(X.shape[0])]
Xb = np.insert(X, 0, ones, axis=1)
B = np.linalg.inv(np.dot(Xb.T, Xb)).dot(Xb.T).dot(y)
#rSquared = (y - np.dot(Xb.T, B))**2
n = 6
lambda_ = 0.01
# not sure what to do at this point

# my code starting here

# this is the objective function that sci-kit learn minimizes
# I'm using it here just to show you this method works
def estimate(B):
    return (1./(2*n)) * np.sum((y - np.dot(Xb, B))**2) +  lambda_ * np.sum(np.abs(B))

Then you can do the argmin part:

from scipy.optimize import minimize
res = minimize(estimate, np.ones(6)) # some random starting point

print res['x']
[-0.574803   -0.53032474  1.18149914  0.07927632 -1.22071654 -1.18796359]
print estimate(res['x'])
0.052964866771

And sci-kit:

from sklearn import linear_model
clf = linear_model.Lasso(alpha=lambda_, fit_intercept=False)
clf.fit(Xb,y)

print clf.coef_
[-0.57474883 -0.53025108  1.18070129  0.078776   -1.22018611 -1.18768101]
print estimate(clf.coef_)
0.0529648921815

Pretty close. So now you know this method works (if you trust sci-kit). The only difference between my code and yours, is the estimate function. You'll get slightly different results from sci-kit, but technically it works.

Also, in order to use minimize in this way, you can't calculate your rSquared term outside of the estimate because scipy needs to minimize B for the entire formula together.

Best Answer

Related Solutions

Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

Solved – How to find lasso beta estimates

Related Question