Loss Functions – Choosing the Best Loss Function for Binary Classification

loss-functions

I work in a problem domain where people often report ROC-AUC or AveP (average precision). However, I recently found papers that optimize Log Loss instead, while yet others report Hinge Loss.

While I understand how these metrics are calculated, I am having a hard time understanding the trade-offs between them and which is good for what exactly.

When it comes to ROC-AUC vs Precision-Recall, this thread discusses how ROC-AUC-maximization can be seen as using a loss optimization criteria that penalizes "ranking a true negative at least as large as a true positive" (assuming that higher scores correspond to positives). Also, this other thread also provides a helpful discussion of ROC-AUC in contrast to Precision-Recall metrics.

However, for what type of problems would log loss be preferred over, say, ROC-AUC, AveP or the Hinge loss? Most importantly, what types of questions should one ask about the problem when choosing between these loss functions for binary classification?

Best Answer

The state-of-the-art reference on the matter is [1]. Essentially, it shows that all the loss functions you specify will converge to the Bayes classifier, with fast rates.

Choosing between these for finite samples can be driven by several different arguments:

  1. If you want to recover event probabilities (and not only classifications), then the logistic log-loss, or any other generalized linear model (Probit regression, complementary-log-log regression,...) is a natural candidate.
  2. If you are aiming only at classification, SVM may be a preferred choice, since it targets only observations at the classification buondary, and ignores distant observation, thus alleviating the impact of the truthfulness of the assumed linear model.
  3. If you do not have many observations, then the advantage in 2 may be a disadvantage.
  4. There may be computational differences: both in the stated optimization problem, and in the particular implementation you are using.
  5. Bottom line- you can simply try them all and pick the best performer.

[1] Bartlett, Peter L, Michael I Jordan, and Jon D McAuliffe. “Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association 101, no. 473 (March 2006): 138–56. doi:10.1198/016214505000000907.