In scikit-learn the implementation of Lasso with coordinate descent tends to be faster than our implementation of LARS although for small p (such as in your case) they are roughly equivalent (LARS might even be a bit faster with the latest optimizations available in the master repo). Furthermore coordinate descent allows for efficient implementation of elastic net regularized problems. This is not the case for LARS (that solves only Lasso, aka L1 penalized problems).
Elastic Net penalization tends to yield a better generalization than Lasso (closer to the solution of ridge regression) while keeping the nice sparsity inducing features of Lasso (supervised feature selection).
For large N (and large p, sparse or not) you might also give a stochastic gradient descent (with L1 or elastic net penalty) a try (also implemented in scikit-learn).
Edit: here are some benchmarks comparing LassoLARS and the coordinate descent implementation in scikit-learn
Part 1
In the elastic net two types of constraints on the parameters are employed
- Lasso constraints (i.e. on the size of the absolute values of $\beta_j$)
- Ridge constraints (i.e. on the size of the squared values of $\beta_j$)
$\alpha$ controls the relative weighting of the two types. The Lasso constraints allow for the selection/removal of variables in the model. The ridge constraints can cope with collinear variables. Which you put more weight upon will depend on the data properties; lots of correlated variables may need both constraints, a few correlated variables might suggest more emphasis on ridge constraints.
One way to solve this is to treat $\alpha$ as a tuning parameter alongside $\lambda$ and use the values that give the lowest CV error, in the same way that you are tuning over $\lambda$ at the moment with cv.glmnet
.
The R package caret can build models using the glmnet package and should be set up to tune over both parameters $\alpha$ and $\lambda$.
Part 2
Q3
Yes, in this case where $m \gg n$ (number of variables $\gg$ number of observations), the help page for ?glmnet
suggests to use
type.gaussian = "naive"
Instead of storing all the inner-products computed along the way, which can be inefficient with a large number of variables or when $m \gg n$, the "naive"
option will loop over $n$ each time it is required to computer inner products.
If you had not specified this argument, glmnet
would have chosen "naive"
anyway as $m > 500$, but it is better to specify this explicitly incase the defaults and options change later in the package and you are running code at a future date.
Q4
Short answer, you don't need to specify a high value for nlambda
now that you have chosen an optimal value, conditioned on $\alpha = 0.5$. However, if you want to plot the coefficient paths etc then having a modest set of values of $\lambda$ over the interval results in a much nicer set of paths. The computational burden of doing the entire path relative to one specific $\lambda$ is not that great, the result of a lot of effort to develop algorithms to do this job correctly. I would just leave nlambda
at the default, unless it makes an appreciable difference in compute time.
Q5
This is a question about parsimony. The lambda.min
option refers to value of $\lambda$ at the lowest CV error. The error at this value of $\lambda$ is the average of the errors over the $k$ folds and hence this estimate of the error is uncertain. The lambda.1se
represents the value of $\lambda$ in the search that was simpler than the best model (lambda.min
), but which has error within 1 standard error of the best model. In other words, using the value of lambda.1se
as the selected value for $\lambda$ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the $k$-fold CV estimate of the error of the best model.
The choice is yours:
- The best model that may be too complex of slightly overfitted:
lambda.min
- The simplest model that has comparable error to the best model given the uncertainty:
lambda.1se
Part 3
This is a simple one and is something you'll come across a lot with R. You use the predict()
function 99.9% of the time. R will arrange for the use of the correct function for the object supplied as the first argument.
More technically, predict
is a generic function, which has methods (versions of the function) for objects of different types (technically known as classes). The object created by glmnet
has a particular class (or classes) depending on what type of model is actually fitted. glmnet (the package) provides methods for the predict
function for these different types of objects. R knows about these methods and will choose the appropriate one based on the class of the object supplied.
Best Answer
I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.
In python this gives something like :
which I found here on scikit-learn page.
EDIT : the coefficients have to be standardized before :
$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.