Solved – Does it make sense to use `logit` or `softplus` loss for binary classification problem

classificationcross entropyloss-functionsneural networkssigmoid-curve

With $z$ is the logit, $p \in \{1, 0\}$ is the class. Usually binary classification problem use sigmoid and cross-entropy to compute loss:
$$\mathcal{L_1} = – \sum{p \log \sigma(z) + (1-p) \log (1-\sigma(z))}$$

Now suppose we scaled $y = 2p – 1 \in \{1, -1\}$.

Can we just directly push logit up when class is 1 and down when class is -1 with this loss?
$$\mathcal{L_2} = – \sum{y z}$$

I have seen some code use softplus like this:
$$\mathcal{L_3} = \sum{\log(1 + e^{-y z})}$$

Does it make sense to use these "simpler" losses? Is there any mathematical reasons against using them?

Could you point me to any source (paper, book, blog…) using these "simpler" losses?

Best Answer

The loss function measures some "goodness" or "fitness" of the model to the data. Moreover, in the context of optimization by gradient descent, the loss function must provide gradient information for the optimizer. The gradients must also be stable and informative, that is, they are not exploding, vanishing, or saturating.

From a statistical learning point of view, loss function could be built by the principle of maximum likelihood estimation (MLE), resulting in negative log-likelihood loss (NLL), which is equivalent to cross-entropy loss. The specific form of the loss function depends on the modeling of the predicted probability. For classification, the probability is modeled by sigmoid function, resulting in the form $\mathcal{L_1} $.

Directly pushing logits up and down with an unbounded loss function like $\mathcal{L_2} $ is bad, because there is no stopping threshold in optimizing, a right prediction still affects the loss the same way as a wrong prediction. On contrary, in the softplus loss $\mathcal{L_3} $, already right predictions will contribute less to the loss compared to wrong predictions.

$\mathcal{L_3} $ is actually equivalent to $\mathcal{L_1} $. The first hint is their gradients regarding the score are the same, thus their optimization dynamics are the same. In fact, one can expand their formulas and derive one from another.