Solved – How is the intercept computed in GLMnet

elastic netglmnetlassorregularization

I've been implementing the GLMNET version of elastic net for linear regression with another software than R. I compared my results with the R function glmnet in lasso mode on diabetes data.

The variable selection is ok when varying the value of the parameter (lambda) but I obtain slightly different values of coefficients. For this and other reasons I think it comes from the intercept in the update loop, when I compute the current fit, because I don't vary the intercept (which I take as the mean of the target variable) in the whole algorithm : as explained in Trevor Hastie's article ( Regularization Paths for Generalized Linear Models via Coordinate Descent, Page 7, section 2.6):

the intercept is not regularized, […] for all values of […] lambda [the L1-constraint parameter]

But despite the article, the R function glmnet does provide different values for the intercept along the regularization path (the lambda different values). Does anyone has a clue about how the values of the Intercept are computed?

Best Answer

I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.

In python this gives something like :

        self.intercept_ = ymean - np.dot(Xmean, self.coef_.T)

which I found here on scikit-learn page.

EDIT : the coefficients have to be standardized before :

        self.coef_ = self.coef_ / X_std

$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.

Part 1

In the elastic net two types of constraints on the parameters are employed

Lasso constraints (i.e. on the size of the absolute values of $\beta_j$)
Ridge constraints (i.e. on the size of the squared values of $\beta_j$)

$\alpha$ controls the relative weighting of the two types. The Lasso constraints allow for the selection/removal of variables in the model. The ridge constraints can cope with collinear variables. Which you put more weight upon will depend on the data properties; lots of correlated variables may need both constraints, a few correlated variables might suggest more emphasis on ridge constraints.

One way to solve this is to treat $\alpha$ as a tuning parameter alongside $\lambda$ and use the values that give the lowest CV error, in the same way that you are tuning over $\lambda$ at the moment with cv.glmnet.

The R package caret can build models using the glmnet package and should be set up to tune over both parameters $\alpha$ and $\lambda$.

Part 2

Q3

Yes, in this case where $m \gg n$ (number of variables $\gg$ number of observations), the help page for ?glmnet suggests to use

type.gaussian = "naive"

Instead of storing all the inner-products computed along the way, which can be inefficient with a large number of variables or when $m \gg n$, the "naive" option will loop over $n$ each time it is required to computer inner products.

If you had not specified this argument, glmnet would have chosen "naive" anyway as $m > 500$, but it is better to specify this explicitly incase the defaults and options change later in the package and you are running code at a future date.

Q4

Short answer, you don't need to specify a high value for nlambda now that you have chosen an optimal value, conditioned on $\alpha = 0.5$. However, if you want to plot the coefficient paths etc then having a modest set of values of $\lambda$ over the interval results in a much nicer set of paths. The computational burden of doing the entire path relative to one specific $\lambda$ is not that great, the result of a lot of effort to develop algorithms to do this job correctly. I would just leave nlambda at the default, unless it makes an appreciable difference in compute time.

Q5

This is a question about parsimony. The lambda.min option refers to value of $\lambda$ at the lowest CV error. The error at this value of $\lambda$ is the average of the errors over the $k$ folds and hence this estimate of the error is uncertain. The lambda.1se represents the value of $\lambda$ in the search that was simpler than the best model (lambda.min), but which has error within 1 standard error of the best model. In other words, using the value of lambda.1se as the selected value for $\lambda$ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the $k$-fold CV estimate of the error of the best model.

The choice is yours:

The best model that may be too complex of slightly overfitted: lambda.min
The simplest model that has comparable error to the best model given the uncertainty: lambda.1se

Part 3

This is a simple one and is something you'll come across a lot with R. You use the predict() function 99.9% of the time. R will arrange for the use of the correct function for the object supplied as the first argument.

More technically, predict is a generic function, which has methods (versions of the function) for objects of different types (technically known as classes). The object created by glmnet has a particular class (or classes) depending on what type of model is actually fitted. glmnet (the package) provides methods for the predict function for these different types of objects. R knows about these methods and will choose the appropriate one based on the class of the object supplied.

Best Answer

Related Solutions

Solved – LARS vs coordinate descent for the lasso

Solved – Feature selection & model with glmnet on Methylation data (p>>N)

Part 1

Part 2

Q3

Q4

Q5

Part 3

Related Question