Solved – logloss vs gini/auc

aucginilog-lossmodel selectionvalidation

I've trained two models (binary classifiers using h2o AutoML) and I want to select one to use. I have the following results:

 model_id        auc     logloss    logloss_train   logloss_valid   gini_train  gini_valid

DL_grid_1   0.542694    0.287469         0.092717        0.211956     0.872932    0.312975
DL_grid_2   0.543685    0.251431         0.082616        0.186196     0.900955    0.312662

the auc and logloss columns are the cross-validation metrics (the cross validation only uses the training data). the ..._train and ..._valid metrics are found by running the training and validation metrics through the models respectively. I want to either use the logloss_valid or the gini_valid to choose a the best model.

Model 1 has a better gini (i.e. better AUC) but model two has a better logloss. My question is which one to choose which I think begs the question, what are the advantages/disadvantages to using either gini (AUC) or logloss as a decision metric.

Best Answer

Whereas the AUC is computed with regards to binary classification with a varying decision threshold, logloss actually takes "certainty" of classification into account.

Therefore to my understanding, logloss conceptually goes beyond AUC and is especially relevant in cases with imbalanced data or in case of unequally distributed error cost (for example detection of a deadly disease).

In addition to this very basic answer, you might want to have a look at optimizing auc vs logloss in binary classification problems

A simple example of logloss computation and the underlying concept is discussed in this recent question Log Loss function in scikit-learn returns different values

In addition, a very good point has been made in stackoverflow

One must understand crucial difference between AUC ROC and "point-wise" metrics like accuracy/precision etc. ROC is a function of a threshold. Given a model (classifier) that outputs the probability of belonging to each class we usually classify element to the class with the highest support. However, sometimes we can get better scores by changing this rule and requiring one support to be 2 times bigger than the other to actually classify as a given class. This is often true for imbalanced datasets. This way you are actually modifing the learned prior of classes to better fit your data. ROC looks at "what would happen if I change this threshold to all possible values" and then AUC ROC computes the integral of such a curve.