Solved – softmax+cross entropy compared with square regularized hinge loss for CNNs

cross entropyloss-functionsmachine learningsvm

SVM is actually a single layer neural network, with identity activation and squared regularized hinge loss, and can be optimized with gradients. In addition, squared regularized hinge loss can be transformed into dual form to induce kernel and find the support vector.

Compared with softmax+cross entropy, squared regularized hinge loss has better convergence and better sparsity.

Why softmax+cross entropy is more dominant in neural network?
Why not use squared regularized hinge loss for the CNN?

Best Answer

This guy does an excellent job of working through the math and explanations from intuition and first principles. Take a peek.

tl;dr
Hinge stops penalizing errors after the result is "good enough," while cross entropy will penalize as long as the label and predicted distributions are not identical. The choice of cross-entropy entails that we aiming at the asymptote of perfection, not the threshold of "good enough."