Solved – Deriving binary cross entropy loss function

cross entropyloss-functionsneural networks

I want to know how the equation for binary cross entropy came about. My approach is the following:

Let's say we have two ground truths: $y_1$ and $y_2$. We also have two predictions $p_1$ and $p_2$. Now, $p_2$ can also be defined as $1 -p_1$ since we're dealing with a binary problem.

From this, how exactly do we arrive at this equation:
$$−(y\log{p}+(1−y)\log{(1−p)})$$

And we think of this as a loss function, why does it make sense to minimize this equation?

Best Answer

Suppose there's a random variable $Y$ where $Y \in \{0,1\}$ (for binary classification), then the Bernoulli probability model will give us:

$$ L(p) = p^y (1-p)^{1-y} $$

$$ log(L(p)) = y\log p + (1-y) \log (1-p) $$

Its often easier to work with the derivatives when the metric is in terms of log and additionally, the min/max of loglikelihood is the same as the min/max of likelihood. The inherent meaning of a cost or loss function is such that the more it deviates from the 0, the worse the model performs. The negative sign the just preserves that meaning and is easier to interpret. Maximizing the above function will lead to the same result.