Solved – How to report percentage accuracy for glmnet logistic regression

glmnetlogisticmachine learningrregression

I am using glmnet where my dependent variable is binary (class 0, class 1). I want to report percentage accuracy of the model. So I use the predict function for my test dataset. However, the values returned are in decimals instead of being 0 and 1. So I set a threshold of 0.5, meaning if the predicted value > 0.5, I consider it as 1 and if predicted value <= 0.5, I consider it as 0. Next I create a confusion matrix by comparing the predicted and actual values of my test data. From this I find the accuracy. I have pasted my sample code below. I am not sure if this is the right approach for reporting the accuracy percentage for a glmnet model predicting a binary dependent variable.

data <- read.csv('datafile', header=T)
mat  <- as.matrix(data)
X    <- mat[, c(1:ncol(mat)-1)]
y    <- mat[, ncol(mat)] 
fit  <- cv.glmnet(X, y, family="binomial", type.measure="class", alpha=0.1)

t                             <- 0.2*nrow(mat) #20% of data
t                             <- as.integer(t) 
testX                         <- mat[1:t, 1:ncol(mat)-1]
predicted_y                   <- predict(fit, s=0.01, testX, type='response')
predicted_y[predicted_y>0.5]  <- 1
predicted_y[predicted_y<=0.5] <- 0
Yactual                       <- mat[1:t, ncol(mat)]
confusion_matrix              <- ftable(Yactual, predicted_y)
accuracy                      <- 100* (sum(diag(confusion_matrix)) / length(predicted_y))

Best Answer

glmnet is designed around a proper accuracy score, the (penalized) deviance. Summaries of predictive discrimination should use proper scores, not arbitrary classifications that are at odds with costs of false positives and false negatives. Consider a couple of accepted proper scoring rules: Brier (quadratic) score and logarithmic (deviance-like) score. You can manipulate the proportion classified correctly in a number of silly ways. The easiest way to see this is if the prevalence of $Y=1$ is 0.98 you will be 0.98 accurate by ignoring all the data and predicting everyone to have $Y=1$.

Another way to saying all this is that by changing from an arbitrary cutoff of 0.5 to another arbitrary cutoff, different features will be selected. An improper scoring rule is optimized by a bogus model.