Classification – Impacts of Choosing Different Loss Functions

classificationloss-functionsmachine learningoptimization

We know that some objective functions are easier to optimize and some are hard. And there are many loss functions that we want to use but hard to use, for example 0-1 loss. So we find some proxy loss functions to do the work. For example, we use hinge loss or logistic loss to "approximate" 0-1 loss.

Following plot is coming from Chris Bishop's PRML book. The Hinge Loss is plotted in blue, the Log Loss in red, the Square Loss in green and the 0/1 error in black.

enter image description here

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

By looking at hinge loss and logistic loss, it penalize more on strongly misclassified instances, and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

My question is what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

Best Answer

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison. enter image description here

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

enter image description here

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.