Solved – the history of the “cross entropy” as a loss function for neural networks

cross entropyloss-functionsneural networksreferencessupervised learning

There seems to be a gap in the literature as to why cross-entropy is used.

Older references on neural networks ("ANNs") always use the squared loss. For example, here is one from Chong and Zak "An Intro to Optimization 4th Ed",

enter image description here

Here is the one by Simon Haykin on "Kalman Filter and Neural Networks"
enter image description here


Somewhere along the way, cross-entropy became the dominant loss function that is used in many papers and almost all the "blog" type references on NN. Recall that the cross-entropy is often formulated as,

$$
CE(y, \hat y) = -\sum\limits_{n = 1}^N \sum\limits_{c = 1}^n y_n^c \cdot\log(\hat y_n^c)
$$

where $n$ is the $n$th data, $c$ is the $c$th class, and $y, \hat y$ denotes the set of targets and the predictions, respectively.

Where did the above function even came from (books/papers)? Was there some famous work that used cross entropy that popularized it over the squared loss? Is there a good reason to use CE as opposed to the squared loss (or the softmax loss associated with softmax/multiclass logistic regression)?

Best Answer

If you agree that a logistic regression is a special case of a neural network, then the answer is D. R. Cox, who invented logistic regression.[1]

A neural network with zero hidden layers and a single sigmoid output and trained to maximize the binomial likelihood (equiv. minimize cross-entropy) is logistic regression.

Minimizing a binomial cross-entropy is equivalent to maximizing a particular likelihood: the relationship between maximizing the likelihood and minimizing the cross-entropy

[1] D. R. Cox. "The Regression Analysis of Binary Sequences" Journal of the Royal Statistical Society. Series B (Methodological) Vol. 20, No. 2 (1958), pp. 215-242.