Solved – Normalized Cross Entropy

cross entropy

In this paper: http://quinonero.net/Publications/predicting-clicks-facebook.pdf, the authors introduce a metric called Normalized Cross Entropy (NCE):

$$
\text{NE} = \frac{-\frac{1}{N} \sum_{i=1}^n(y_i\log(p_i) + (1-y_i)\log(1-p_i))}{-(p\log(p) + (1-p)\log(1-p))}
$$

where $p_i$ is the estimated $P(y_i=1)$ and $p=\sum_i y_i/N$ is the "average" probability over the training set. Note that here, unlike the paper, I've assumed $y_i \in \{0,1\}$ to give the numerator the more familiar looking form of binary cross entropy.

The authors claim that the normalization, i.e. dividing the cross entropy in the numerator by the cross entropy for a model that predicts $p$ for every example, is because the closer $p$ is to 0 or 1, the easier it is to achieve a better log loss (i.e. cross entropy, i.e. numerator). Can someone explain why this is true?

Best Answer

First note that the denominator does not depend on the model, so it is only a linear transform of the LLH.
Unless your model is worse than predicting a constant, the denominator should be higher than the numerator, so it is usually between 0 and 1
Typically, when the label is difficult to predict accurately, the LLH is not very far from the denominator. The proposed normalization may allow to get a metric a bit more comparable between datasets with different ratio of positives.

Personally I like using 1 - LLH / Entropy ( So 1 minus their metric), which can be interpreted as the "proportion of entropy explained by the model".

Best Answer

Related Solutions

Solved – Cross-entropy for probabilistic classes

Solved – Binary cross-entropy: plugging in probability 0

Related Question