Binary vs Categorical Cross-Entropy – Clarifying Confusion in Python LSTM

cross entropylstmpython

I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset.
I have applied categorical_crossentropy but I am getting 26% accuracy with activation function sigmoid. When I apply binary_crossentropy, the accuracy drastically increased to 98%.

model = Sequential()
model.add(LSTM(256, input_shape=(n_timesteps,n_features),recurrent_activation='hard_sigmoid'))
model.add(Dense(16))
model.add(Dense(n_outputs, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

and dataset is divided as:

X_train:  (430000, 5, 10)
y_train:  (430000, 1)

Best Answer

There are circumstances when the two losses are equivalent, but those circumstances do not exist in OP's code.

In a comment, OP writes that they only have one output neuron.

With 1 output neuron and binary cross-entropy, the model outputs a single value $p$ abd loss for one example is computed as

$$ L_b = -y \log p - (1 - y) \log (1 - p), $$ which is the correct way to compute the loss.

However, with 1 output neuron and categorical cross-entropy, the loss is computed as

$$ L_c = -y \log p $$

which is clearly different because it fixes $(1-y) \log(1-p)=0$. This loss is obviously bogus because it is minimized at $L_c = 0$ by setting $p=1$ regardless of the input, resulting in a totally useless model.

To use categorical cross-entropy correctly, OP needs to make these changes

use $k$ output neurons (one for each of the $k$ classes). In OP's particular case, $k=2$
these output neurons need to be a probability vector: the neurons sum to 1 for all inputs, and all values are non-negative. The standard way to do this is to use a softmax activation in the output layer.

After making these changes, the loss will be computed correctly when using categorical cross-entropy. This is because what we want to have is the model outputs $p_1, p_2$ so that the loss is

$$ L_c = -y \log p_1 - (1 - y) \log p_2 $$

where $ 0 \le p_i \le 1$ and $p_1 + p_2 = 1$. In this setting, it's simple algebra to show that $L_c = L_b$, as desired.

Related Solutions

Cross-Entropy Loss – Choosing Between Categorical and Binary Cross-Entropy for Binary Predictions

Bernoulli$^*$ cross-entropy loss is a special case of categorical cross-entropy loss for $m=2$.

$$ \begin{align} \mathcal{L}(\theta) &= -\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^m y_{ij}\log(p_{ij}) \\ &= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i) \log(1-p_i)\right] \end{align} $$

Where $i$ indexes samples/observations and $j$ indexes classes, and $y$ is the sample label (binary for LSH, one-hot vector on the RHS) and $p_{ij}\in(0,1):\sum_{j} p_{ij} =1\forall i,j$ is the prediction for a sample.

I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of $k \ge 2$ labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

For your problem, the good metric is the categorical_accuracy. What happens is that when you set the loss to be binary_crossentropy and metrics to accuracy then keras assumes that the good metric is binary_accuracy which is just plain wrong when there is more than 2 labels.

What you have to do is to specify explicitly that the metric is categorical_accuracy like this:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

see the details in this answer: https://stackoverflow.com/a/46038271/6338493

Best Answer

Related Solutions

Cross-Entropy Loss – Choosing Between Categorical and Binary Cross-Entropy for Binary Predictions

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

Related Question