Solved – Classification with confidence scores: is regression ok

classificationmachine learning

Say you have a binary classification problem, but you'd like to have a sense of how confident the classifier is by using a numerical score and then using a threshold for the binarization.

This can be done by reaching into the model and using it's scoring function, but that's not always available in library implementations.

Many ML libraries have regression equivalents to classifiers, like SVR of Random Forest Regressors.

Is it ok to use regression on labels as a proxy for a classifier confidence score? Are they different?

By this I mean, say I have $n$ samples $X_1, …, X_n$ and class labels $y_1,…,y_n$ with $y_i \in \{0,1\}$. I could train a binary classifier on these, or I could pretend that the $y_i \in \mathbb{R}$ and the labels happen to be 1.0 or 0.0.

If I train a regressor on this formulation, what is wrong with using the outputs as surrogates for classifier scores, and then evaluating the model with ROC and AUC?

Best Answer

In machine learning, you can get away with many approximations if you can show they are useful. There are some questions and answers on this site stating that in some cases, linear regression can do for classification without using the extensions and adaptations that logistic regression made to it for that purpose.

In the case of random forest, know that they already come with approximate label confidences defined by the proportion of trees that classify the record with that label.

But who knows, perhaps you are on to something. You would just need to prove it empirically. When you want to show that your way provides better label confidences, you are saying that across the spectrum of possible classification cutoffs, they should reach better classification results than with the other way of estimating label confidences. You could compare classification performance with ROC AUC because that is what a ROC curve represents.

Related Question