Solved – Cross-entropy for probabilistic classes

cross entropyprobability

I can understand why cross-entropy works as a cost function, if we classify a sample as belonging to one of the classes $c_1, …, c_n$:

  • $P_{true}(c_i = c_i^{true}) = 1$;
  • $H_e = -\log{P_{predicted}(c_i = c_i^{true})}$, because all other true probabilities are zeroes.

Now, if you predict probability close to $1.0$, cross-entropy aims toward zero. If you predict small probability for a true class, than you are getting penalized a lot by logarithm.

But what if you don't know a true label. The only you know is that:

  • $P_{true}(c_1) = 0.3$
  • $P_{true}(c_2) = 0.25$
  • $P_{true}(c_3) = 0.45$

Now if:

  • $P_{pred}(c_1) = 0.3$
  • $P_{pred}(c_2) = 0.25$
  • $P_{pred}(c_3) = 0.45$

and we compute cross-entropy:

  • $H_e = -(0.3 \log{0.3} + 0.25 \log{0.25} + 0.45 \log{0.45}) / 3 \approx 0.356$

for a completelly true probabilistic prediction. Why is so?

Best Answer

The fact that the cross entropy is more than zero just means that even if you have the probabilities correct, you still cannot predict the outcome of a particular event. If you were to predict the outcomes of a series of coin flips (assuming the coin is fair), you would be correct to predict that the outcomes will be 50% heads, 50% tails. But you wouldn't be able to predict the outcome of any particular flip, so the cross entropy will be greater than zero.

Getting the probabilities correct will minimize the cross entropy. If you predict that you will get heads 75% of the time, you will find that the cross entropy is greater than the cross entropy using the true probabilities.

$H_{wrong} = -(.5 \log .25 + .5 \log .75) \approx .837$

$H_{correct} = -(.5 \log .5 + .5 \log .5) \approx .693$

The correct probabilities form a lower bound for the cross entropy, but this lower bound is not necessarily zero --- it is zero if and only if the process is deterministic (i.e., the true probability of one of the classes is one).

Remember that the cross entropy represents the number of bits needed to represent an event drawn from one distribution when it's encoded using a scheme optimized for another distribution. In the case that your process is deterministic, the optimum encoding scheme needs no bits at all --- you already know the outcome beforehand. But if you have a random process like flipping a coin, you will need some bits to communicate heads or tails no matter what.