Solved – Calculation misclassification error rate in glmnet (LASSO)

glmnetlasso

Using glmnet, different metrics can be used to find the optimal value for log lambda, using cross-validation. For example, the maximum ROC-AUC for classification, or the minimum misclassification error rate.

Let's assume our glmnet model has a binary response (e.g., disease, yes vs. no).

Different steps in glmnet:
(1) Define coefficient path for model predictors as function of log lambda.
(2) X-fold cross-validation to find optimal log lambda that corresponds with lowest cross-validation misclassification error rate.
(3) Apply log lambda that corresponds with lowest cross-validation misclassification error rate to find coefficients for markers in the glmnet model.

The misclassification error rate is a simple metric based on a confusion matrix. But, it requires a dichotomous variable (predicted disease, yes vs. no), not a continuous variable. So, in the glmnet algorithm, how exactely are the predicted classes defined?

Best Answer

I just ran across this question because I was wondering the same thing. I think I have it figured out, however. The misclassification error is a test/outsample error when using a 0/1 loss function, conditional on the used data set/model (with a given threshold). It averages out this error over all X and Y, even those not in the data set. In the spirit of cross validation, we are, however, almost always interested in the expected test error, which averages over the distribution of test data sets. The misclassification error you see as output of, for example, the plot of the cv.glmnet, is averaged out for every model you could employ with a varying threshold for every 'dataset' in the cross-validation procedure.

Related Solutions

Solved – How to report most important predictors using glmnet

set alpha = 0 in cv.glmnet() to use ridge instead of lasso.

"It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others." glmnet manual

You are already sampling the data by using cv.glmnet() (as opposed to simply using glmnet())
It is my understanding that for each lambda, you have a model. So lambda.min is the lambda-value for the model with the lowest error.

User Jason has example code posted in another question, that I believe will help: https://stats.stackexchange.com/a/92167

Solved – Exact definition of Deviance measure in glmnet package, with crossvalidation

In Friedman, Hastie, and Tibshirani (2010), the deviance of a binomial model, for the purpose of cross-validation, is calculated as

minus twice the log-likelihood on the left-out data (p. 17)

Given that this is the paper cited in the documentation for glmnet (on p. 2 and 5), that is probably the formula used in the package.

And indeed, in the source code for function cvlognet, the deviance residuals for the response are calculated as

-2*((y==2)*log(predmat)+(y==1)*log(1-predmat))

where predmat is simply

predict(glmnet.object,x,lambda=lambda)

and passed in from the encolsing cv.glmnet function. I used the source code available on the JStatSoft page for the paper, and I don't know how up-to-date that code is. The code for this package is surprisingly simple and readable; you can always check for yourself by typing glmnet:::cv.glmnet.

Best Answer

Related Solutions

Solved – How to report most important predictors using glmnet

Solved – Exact definition of Deviance measure in glmnet package, with crossvalidation

Related Question