Solved – Why is cross entropy not a common evaluation metric for model performance

classificationcross entropymodel-evaluation

When we train a classifier, we use cross entropy as a loss function and, for example, an F-Score as an evaluation metric, but why?

Why not use cross entropy on the test set to evaluate the model performance?

Especially in a scenario where we care about the confidence of the model, it would give us a nice metric. Yet, I can't remember seeing a single paper using this. So I must be missing something.

Best Answer

I always use (test) cross-entropy under cross-validation to assess the performance of a classification model. It's far more robust than accuracy on small datasets (because accuracy isn't "smooth"), and far more meaningful than accuracy (although perhaps not than precision and recall) when classes are imbalanced.

However, the problem with cross-entropy, is that it doesn't live on any objective scale, it's a very relative metric. You can compare the performance of XGBoost Vs a Neural Network on a given data set and the one with a lower cross-entropy (or higher test log-likelihood) is the better model. Saying that "XGBoost gets a cross-entropy of X on problem A and a cross-entropy of Y on problem B" is harder to interpret.

In general, from an information theoretic point of view, binary classification with balanced classes is a "harder" problem than binary classification with a 90/10 class imbalance, as you have less information to start with (more mathematically compare $0.1\ln 0.1 + 0.9 \ln 0.9$ to $2\cdot 0.5\ln 0.5$). If you're trying to gauge to what extent your classifier is performing well for two different problems, with different class balances, you have the competing effects that perhaps one problem's features contain more information about the target variable, but the other problem is just easier to solve.

For that reason, you wouldn't get an academic paper (I hope anyway) which says, "we used a neural network to approach this problem for the first and got a cross-entropy of X". It would however be legit to say "usually people use neural networks to approach this problem and get a cross-entropy of X, but we used XGBoost and got a cross-entropy of Y", because then you're comparing two classifiers on the same problem

Related Question