Solved – the history of the “cross entropy” as a loss function for neural networks

There seems to be a gap in the literature as to why cross-entropy is used.

Older references on neural networks ("ANNs") always use the squared loss. For example, here is one from Chong and Zak "An Intro to Optimization 4th Ed",

Here is the one by Simon Haykin on "Kalman Filter and Neural Networks"

Somewhere along the way, cross-entropy became the dominant loss function that is used in many papers and almost all the "blog" type references on NN. Recall that the cross-entropy is often formulated as,

$$
CE(y, \hat y) = -\sum\limits_{n = 1}^N \sum\limits_{c = 1}^n y_n^c \cdot\log(\hat y_n^c)
$$
where $n$ is the $n$th data, $c$ is the $c$th class, and $y, \hat y$ denotes the set of targets and the predictions, respectively.

Where did the above function even came from (books/papers)? Was there some famous work that used cross entropy that popularized it over the squared loss? Is there a good reason to use CE as opposed to the squared loss (or the softmax loss associated with softmax/multiclass logistic regression)?

Best Answer

If you agree that a logistic regression is a special case of a neural network, then the answer is D. R. Cox, who invented logistic regression.[1]

A neural network with zero hidden layers and a single sigmoid output and trained to maximize the binomial likelihood (equiv. minimize cross-entropy) is logistic regression.

Minimizing a binomial cross-entropy is equivalent to maximizing a particular likelihood: the relationship between maximizing the likelihood and minimizing the cross-entropy

[1] D. R. Cox. "The Regression Analysis of Binary Sequences" Journal of the Royal Statistical Society. Series B (Methodological) Vol. 20, No. 2 (1958), pp. 215-242.

Best Answer

Related Solutions

Neural Networks – Different Definitions of Cross Entropy Loss Function Explained

Solved – softmax+cross entropy compared with square regularized hinge loss for CNNs

Related Question