Solved – Calculation misclassification error rate in glmnet (LASSO)

glmnetlasso

Using glmnet, different metrics can be used to find the optimal value for log lambda, using cross-validation. For example, the maximum ROC-AUC for classification, or the minimum misclassification error rate.

Let's assume our glmnet model has a binary response (e.g., disease, yes vs. no).

Different steps in glmnet:
(1) Define coefficient path for model predictors as function of log lambda.
(2) X-fold cross-validation to find optimal log lambda that corresponds with lowest cross-validation misclassification error rate.
(3) Apply log lambda that corresponds with lowest cross-validation misclassification error rate to find coefficients for markers in the glmnet model.

The misclassification error rate is a simple metric based on a confusion matrix. But, it requires a dichotomous variable (predicted disease, yes vs. no), not a continuous variable. So, in the glmnet algorithm, how exactely are the predicted classes defined?

Best Answer

I just ran across this question because I was wondering the same thing. I think I have it figured out, however. The misclassification error is a test/outsample error when using a 0/1 loss function, conditional on the used data set/model (with a given threshold). It averages out this error over all X and Y, even those not in the data set. In the spirit of cross validation, we are, however, almost always interested in the expected test error, which averages over the distribution of test data sets. The misclassification error you see as output of, for example, the plot of the cv.glmnet, is averaged out for every model you could employ with a varying threshold for every 'dataset' in the cross-validation procedure.