What does high auc score but poor f1 indicate for imbalanced dataset

classificationmachine learningprobabilityrocunbalanced-classes

I am working on a binary classification with an imbalanced dataset of 977 records (77:23 class ratio). My label 1 (POS not met) is the minority class.

Currently without any over/under sampling techniques (as it is not recommended), I get the below performance (using `class_weight parameter though)

Confusion Matrix - Train dataset

And my roc_auc score is 0.8024156371012354

Based on the above results, to me, it feels like my model still doesn't perform well on POS not met which is our label 1.

However, my auc is 80% which feels like a decent figure to look at.

Now my question is as follows

a) Irrespective of business decision to keep/reject the model, based on the above metrics alone, how do I know my model is performing?
I read that AUC talks about discriminative ability between the positive and negative classes. **Does it mean my model decision threshold should be 0.8?
** While my model is good at discriminating between positive and negative, it is bad at identifying not met as not met (recall only 60%). But my dataset is imbalanced though. Would auc still apply?

b) Is my dataset imbalanced first of all?
What is considered imbalance? 1:99 or 10:90 or 20:80 etc?
Is there any measure (like correlation coefficient) that can indicate the imbalance level?

c) Based on above matrix, how should I interpret the f1-score, recall and auc together?
What does high auc but poor f1 mean?

update

I used the code below (from online) to get the best f1 at different thresholds

enter image description here

Best Answer

Out of context, it’s really hard to say how good performance is. While your AUC around $0.8$ could be quite good, it could be that your performance is rather pedestrian or that even a value of $0.55$ is excellent.

A key point to remember for the $F1$ score is that it requires a threshold, while AUC is calculated over all thresholds, and your software is using a default threshold of $0.5$ that might be wildly inappropriate. You might find it informative to write a bit of code that calculates the $F1$ over a range of thresholds, something like:

for threshold in (0.1, 0.2, 0.3,…0.8, 0.9):
    Map probability outputs to categories
    Calculate F1

I suspect you will find a better $F1$ score at a different threshold. With a few more lines, you can plot the $F1$ as a function of the threshold, which you might find useful.

This question at least alludes to the decision-making aspect of the problem, too, where a hard decision about a category must be made. I, along with plenty of high-reputation members here, would argue that, unless you know what you gain from correct classifications and what you lose from incorrect classifications, you have no business making hard categorical predictions and should be predicting probabilities. Nonetheless, the code exercise above should show that you can tweak the threshold as needed to make hard classifications once the gains and losses are known.

I will close with some of the usual links I post on this topic. Having seen several of your posts here and on Data Science, I highly recommend Frank Harrell’s blog posts.

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Related Question