Solved – Why is lasso in matlab much slower than glmnet in R (10 min versus ~1 s)

feature selectionregressionregularization

I observed that the function lasso in MATLAB is relatively slow. I run many regression problems, with typically 1 to 100 predictors and 200 to 500 observations. In some cases, lasso turned out to be extremely slow (to solve a regression problems it took several minutes). I discovered that this was the case when the predictors were highly correlated (e.g., air temperature time series at neighboring grid points of an atmospheric model).

I compared the performances of the below example in matlab and in R.

y is the predictand vector with 163 elements (representing observations) and x is the predictor matrix with 100 rows and 163 observations corresponding to the observations in y. I applied the MATLAB function lasso as follows:

[beta_L,stats]=lasso(x,y,'cv',4);

The same in R, using glmnet:

fit.lasso=cv.glmnet(predictor.ts,predictand.ts,nfolds=4)

Both MATLAB and R are based a coordinate descent algorithm. The default value for the number of lambda values is 100 for both lasso and glmnet. The convergence threshold for the coordinate descent is per default 10^-4 in matlab, and even lower in R (10^-7).

The R function takes one second on my computer. Matlab takes several minutes, with most of the computation time spend in the coordinate descent algorithm.

When the predictors are less correlated (e.g., different variable types of a numerical atmospheric model) lasso in Matlab is not so slow, but still takes ~30 – compared to ~ 1 s in R).

Is matlab lasso really much more inefficient than glmnet, or do I miss something?

Best Answer

glmnet in R is fast because it uses what's called regularization paths. Basically, you select a ordered sequence of penalization parameters $\lambda_1, \ldots \lambda_m$. The solution for $\lambda_1$ is used as a warm start for $\lambda_{2}$, the solution for $\lambda_2$ used as a warm start for $\lambda_3$, and so on. This is because the solutions should be close to one another. So fitting the model for the $(n+1)$th penalization parameter, you don't start the coordinate descent from a completely random place in the parameter space. Instead you start from somewhere that's already close to the solution: the parameters for the $n$th model.

If you run separate glmnet calls for each $\lambda$, it's considerably slower, and indeed the documentation in ?glmnet states the following about the lambda parameter:

WARNING: use with care. Do not supply a single value for lambda [...] Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.

Emphasis mine. So in the time a non-regularization-path approach computes the solution for one $\lambda$ the regularization-path-based one has already done all of the $\lambda$s and started on the next fold. See also the comment to this answer from Chris Haug. Apparently he has access to MATLAB, which I don't. His findings seem to confirm my suspicion that the difference in speed comes from the use of the regularization path.

Related Solutions

Solved – How is the intercept computed in GLMnet

I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.

In python this gives something like :

        self.intercept_ = ymean - np.dot(Xmean, self.coef_.T)

which I found here on scikit-learn page.

EDIT : the coefficients have to be standardized before :

        self.coef_ = self.coef_ / X_std

$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.

Solved – How to interpret this cross-validated sparse LDA figure using CARET package

question 0: why did you get ROC values? Because there is no model-specific variables importance method implemented for this model. From ?varImp it has "For models that do not have corresponding varImp methods, see filerVarImp."

1: There are a few reasons why more regularization may help. The primary one would be that you have correlated predictors and using the L2 penalty mitigates that. Also, it constrains the model fit so that you must have large(r) effect on the model fit to get large coefficients.

2: In the past, I have also been surprised that (what I consider to be) very large values of the L2 penalty end up having great results. My best guess is that, since the penalty is on the sum of the squared coefficients, the penalty may need to be large if there are a lot of predictors (but that is not the case here). I'm guessing that the positive influence of the L2 penalty is simply preventing overfitting from large coefficients (for example, see section 11.5.2 of HTF).

3: caret does have a class called predictors for this exact purpose. I haven't implemented it for this model (but I'll put it in the next release).

To get the answers that you want, the underlying sda object has the information. For the mdrr data in caret:

> set.seed(1)
> obj <- train(mdrrDescr[, 1:10], mdrrClass, 
+              method = "sparseLDA",
+              tuneGrid = data.frame(NumVars = 3, lambda = 1),
+              preProc = c("center", "scale"),
+              trControl = trainControl(method = "cv"))

Normally, you would try this first:

> predictors(obj)
[1] NA

but here you should use:

> obj$finalModel$xNames[obj$finalModel$varIndex]
[1] "Sp" "Me" "Mp"

The reason that you get more than 3 predictors for the iris data is that it will use 3 predictors per class:

> obj$finalModel$beta
           [,1]       [,2]
[1,]  0.0000000 -0.1133590
[2,]  0.2117277  0.6238909
[3,] -0.6860342  0.0000000
[4,] -0.6059387  0.4372637

Best Answer

Related Solutions

Solved – How is the intercept computed in GLMnet

Solved – How to interpret this cross-validated sparse LDA figure using CARET package

Related Question