Solved – Binary cross-entropy: plugging in probability 0

cross entropy

There is an answer on the Kaggle question board here by Dr. Fuzzy:

You can assess a total miss-classification szenario by plugging zero-probs in the log-loss 
function (here sklearn log-loss):

LL           Count    Class
3.31035117   15294    toxic
0.34523409    1595    severe_toxic
1.82876664    8449    obscene
0.10346200     478    threat
1.70495856    7877    insult 
0.30410902    1405    identity_hate

For some classes the possible LL for total miss-classification is really low.
In this range gradients might no longer provide meaningful directions.
Another point is that most log-loss implementations use clipping for probs near 0. 
This will play in here as well.

I understand this person is saying "if you never predict the class 'toxic', then what is the log loss?" And the answer is 3.31035117. My question is: how can you possibly get a non-infinite answer?

As far as I know, sklearn's logloss function is binary crossentropy.

The binary crossentropy function is (for a single label):

-( ylog(p) + (1-y)log(1-p) )

If the label is "toxic" (y=1), but we associate that with probability 0 (p=0), we should get:

-( 1log(0) + (1-1)log(1-0) ) = -( -inf + 0 ) = inf

Why are we not getting infinity here?

Best Answer

As Dr. Fuzzy points out sklearn's log-loss uses clipping. This implies that it isn't putting a 0 in for $p$, but rather some small epsilon value. This is put in to avoid the infinities/weirdness associated with probability 0/1 that you noted.