Solved – Using “accuracy” as a measure of performance for logistic regression

accuracyauclogisticroc

I can't quite wrap my head around something that should be relatively fundamental. I'm familiar with ROC curves and AUC, but what I'm confused about is using accuracy for logistic regression. My understanding is that a logistic regression model outputs probabilities, and then those probabilities can be classified as one of either class (assuming a binary dependent variable), based on a threshold value. This makes sense in the case of an AUC value, since the value represents the entire range of the threshold from 0 to 1. But how can we use "accuracy" as a way to determine the performance of a model when the predicted classes are dependent on the threshold value? I keep seeing it used in examples of Python code (using sklearn) but don't understand how it works. Using AUC make sense, but accuracy I don't get.

Shouldn't we get a whole range of accuracy values for each model, the same way we get a whole range of true positive and false positive rates for each model that combine to give a single AUC score?

Best Answer

As @MatthewDrury writes, this really isn't something people should typically be doing. People do it, to be blunt, because they don't know what they're doing and they are mechanically mimicking what they have seen others do. For more information about this topic, see my answer here: RMSE (Root Mean Squared Error) for logistic models.

To answer your explicit question, the accuracy of a model's classifications are contingent on the model plus the threshold. A better model, or the same model with a different threshold, may have better accuracy. In essence, people are saying 'here is the accuracy we will get if we use this model with this threshold'.