In this paper: http://quinonero.net/Publications/predicting-clicks-facebook.pdf, the authors introduce a metric called Normalized Cross Entropy (NCE):
$$
\text{NE} = \frac{-\frac{1}{N} \sum_{i=1}^n(y_i\log(p_i) + (1-y_i)\log(1-p_i))}{-(p\log(p) + (1-p)\log(1-p))}
$$
where $p_i$ is the estimated $P(y_i=1)$ and $p=\sum_i y_i/N$ is the "average" probability over the training set. Note that here, unlike the paper, I've assumed $y_i \in \{0,1\}$ to give the numerator the more familiar looking form of binary cross entropy.
The authors claim that the normalization, i.e. dividing the cross entropy in the numerator by the cross entropy for a model that predicts $p$ for every example, is because the closer $p$ is to 0 or 1, the easier it is to achieve a better log loss (i.e. cross entropy, i.e. numerator). Can someone explain why this is true?
Best Answer
Personally I like using 1 - LLH / Entropy ( So 1 minus their metric), which can be interpreted as the "proportion of entropy explained by the model".