There is an answer on the Kaggle question board here by Dr. Fuzzy:
You can assess a total miss-classification szenario by plugging zero-probs in the log-loss
function (here sklearn log-loss):
LL Count Class
3.31035117 15294 toxic
0.34523409 1595 severe_toxic
1.82876664 8449 obscene
0.10346200 478 threat
1.70495856 7877 insult
0.30410902 1405 identity_hate
For some classes the possible LL for total miss-classification is really low.
In this range gradients might no longer provide meaningful directions.
Another point is that most log-loss implementations use clipping for probs near 0.
This will play in here as well.
I understand this person is saying "if you never predict the class 'toxic', then what is the log loss?" And the answer is 3.31035117. My question is: how can you possibly get a non-infinite answer?
As far as I know, sklearn's logloss function is binary crossentropy.
The binary crossentropy function is (for a single label):
-( ylog(p) + (1-y)log(1-p) )
If the label is "toxic" (y=1), but we associate that with probability 0 (p=0), we should get:
-( 1log(0) + (1-1)log(1-0) ) = -( -inf + 0 ) = inf
Why are we not getting infinity here?
Best Answer
As Dr. Fuzzy points out sklearn's log-loss uses clipping. This implies that it isn't putting a 0 in for $p$, but rather some small epsilon value. This is put in to avoid the infinities/weirdness associated with probability 0/1 that you noted.