Suppose we have two greyscale images which are flattened to 1d arrays: $y=(y_1, y_2, \ldots, y_n)$ and $\hat{y} = (\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n)$ with pixel values in $[0,1]$. How exactly do we use cross-entropy to compare these images?
The definition of cross entropy leads me to believe that we should compute $$-\sum_{i} y_i \log \hat{y}_i,$$ but in the machine learning context I usually see loss functions using "binary" cross entropy, which I believe is $$ -\sum_i y_i \log \hat{y}_i – \sum_i (1-y_i) \log (1-\hat{y}_i).$$
Can someone please clarify this for me?
Best Answer
The cross-entropy between a single label and prediction would be
$$L = -\sum_{c \in C} y_{c} \log \hat y_{c}$$
where $C$ is the set of all classes. This is the first expression in your post. However, we need to sum over all pixels in an image to apply this:
$$L = -\sum_{i \in I} \sum_{c \in C} y_{i,c} \log \hat y_{i,c}$$
where $I$ is the set of pixels in an image and with $y_{i,c}$ being an indicator variable for whether the $i$th pixel is in class $c$.
In the binary case, we only have two classes: $0$ and $1$.
$$L = -\sum_{i \in I} \left( y_{i,0} \log \hat y_{i,0} + y_{i,1} \log \hat y_{i,1} \right)$$
Since $y_{i,0} + y_{i,1}$ must necessarily sum to 1, we can also just drop the class indices and denote $y_i = y_{i,0}$ and $1-y_i = y_{i,1}$.
$$L = -\sum_{i \in I} \left( y_i \log \hat{y_i} + (1-y_i) \log (1-\hat y_i) \right)$$
This is where the second equation in your post comes from.