Cross-Entropy Loss – Choosing Between Categorical and Binary Cross-Entropy for Binary Predictions

cross entropyloss-functionsmachine learningneural networkstensorflow

First of all, I realized if I need to perform binary predictions, I have to create at least two classes through performing a one-hot-encoding. Is this correct? However, is binary cross-entropy only for predictions with only one class? If I were to use a categorical cross-entropy loss, which is typically found in most libraries (like TensorFlow), would there be a significant difference?

In fact, what are the exact differences between a categorical and binary cross-entropy? I have never seen an implementation of binary cross-entropy in TensorFlow, so I thought perhaps the categorical one works just as fine.

Best Answer

Bernoulli$^*$ cross-entropy loss is a special case of categorical cross-entropy loss for $m=2$.

$$ \begin{align} \mathcal{L}(\theta) &= -\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^m y_{ij}\log(p_{ij}) \\ &= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i) \log(1-p_i)\right] \end{align} $$

Where $i$ indexes samples/observations and $j$ indexes classes, and $y$ is the sample label (binary for LSH, one-hot vector on the RHS) and $p_{ij}\in(0,1):\sum_{j} p_{ij} =1\forall i,j$ is the prediction for a sample.


I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of $k \ge 2$ labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.

Related Question