Solved – Exact definition of Deviance measure in glmnet package, with crossvalidation

cross-validationdevianceglmnetlarslasso

For my current reseach I'm using the Lasso method via the glmnet package in R on a binomial dependent variable.

In glmnet the optimal lambda is found via cross-validation and the resulting models can be compared with various measures, e.g. misclassification error or deviance.

My question: How exactly is deviance defined in glmnet? How is it calculated?

(In the corresponding paper "Regularization Paths for Generalized Linear Models
via Coordinate Descent" by Friedman et al. I only find this comment on the deviance used in cv.glmnet: "mean deviance (minus twice the log-likelihood on the left-out data)" (p. 17)).

Best Answer

In Friedman, Hastie, and Tibshirani (2010), the deviance of a binomial model, for the purpose of cross-validation, is calculated as

minus twice the log-likelihood on the left-out data (p. 17)

Given that this is the paper cited in the documentation for glmnet (on p. 2 and 5), that is probably the formula used in the package.

And indeed, in the source code for function cvlognet, the deviance residuals for the response are calculated as

-2*((y==2)*log(predmat)+(y==1)*log(1-predmat))

where predmat is simply

predict(glmnet.object,x,lambda=lambda)

and passed in from the encolsing cv.glmnet function. I used the source code available on the JStatSoft page for the paper, and I don't know how up-to-date that code is. The code for this package is surprisingly simple and readable; you can always check for yourself by typing glmnet:::cv.glmnet.

Related Solutions

Solved – LARS vs coordinate descent for the lasso

In scikit-learn the implementation of Lasso with coordinate descent tends to be faster than our implementation of LARS although for small p (such as in your case) they are roughly equivalent (LARS might even be a bit faster with the latest optimizations available in the master repo). Furthermore coordinate descent allows for efficient implementation of elastic net regularized problems. This is not the case for LARS (that solves only Lasso, aka L1 penalized problems).

Elastic Net penalization tends to yield a better generalization than Lasso (closer to the solution of ridge regression) while keeping the nice sparsity inducing features of Lasso (supervised feature selection).

For large N (and large p, sparse or not) you might also give a stochastic gradient descent (with L1 or elastic net penalty) a try (also implemented in scikit-learn).

Edit: here are some benchmarks comparing LassoLARS and the coordinate descent implementation in scikit-learn

Solved – How is the intercept computed in GLMnet

I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.

In python this gives something like :

        self.intercept_ = ymean - np.dot(Xmean, self.coef_.T)

which I found here on scikit-learn page.

EDIT : the coefficients have to be standardized before :

        self.coef_ = self.coef_ / X_std

$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.

Best Answer

Related Solutions

Solved – LARS vs coordinate descent for the lasso

Solved – How is the intercept computed in GLMnet

Related Question