Binary vs Categorical Cross-Entropy – Clarifying Confusion in Python LSTM

cross entropylstmpython

I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset.
I have applied categorical_crossentropy but I am getting 26% accuracy with activation function sigmoid. When I apply binary_crossentropy, the accuracy drastically increased to 98%.

model = Sequential()
model.add(LSTM(256, input_shape=(n_timesteps,n_features),recurrent_activation='hard_sigmoid'))
model.add(Dense(16))
model.add(Dense(n_outputs, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

and dataset is divided as:

X_train:  (430000, 5, 10)
y_train:  (430000, 1)

Best Answer

There are circumstances when the two losses are equivalent, but those circumstances do not exist in OP's code.

In a comment, OP writes that they only have one output neuron.

With 1 output neuron and binary cross-entropy, the model outputs a single value $p$ abd loss for one example is computed as

$$ L_b = -y \log p - (1 - y) \log (1 - p), $$ which is the correct way to compute the loss.

However, with 1 output neuron and categorical cross-entropy, the loss is computed as

$$ L_c = -y \log p $$

which is clearly different because it fixes $(1-y) \log(1-p)=0$. This loss is obviously bogus because it is minimized at $L_c = 0$ by setting $p=1$ regardless of the input, resulting in a totally useless model.

To use categorical cross-entropy correctly, OP needs to make these changes

  • use $k$ output neurons (one for each of the $k$ classes). In OP's particular case, $k=2$
  • these output neurons need to be a probability vector: the neurons sum to 1 for all inputs, and all values are non-negative. The standard way to do this is to use a softmax activation in the output layer.

After making these changes, the loss will be computed correctly when using categorical cross-entropy. This is because what we want to have is the model outputs $p_1, p_2$ so that the loss is

$$ L_c = -y \log p_1 - (1 - y) \log p_2 $$

where $ 0 \le p_i \le 1$ and $p_1 + p_2 = 1$. In this setting, it's simple algebra to show that $L_c = L_b$, as desired.