I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset.
I have applied categorical_crossentropy
but I am getting 26% accuracy with activation function sigmoid. When I apply binary_crossentropy
, the accuracy drastically increased to 98%.
model = Sequential()
model.add(LSTM(256, input_shape=(n_timesteps,n_features),recurrent_activation='hard_sigmoid'))
model.add(Dense(16))
model.add(Dense(n_outputs, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
and dataset is divided as:
X_train: (430000, 5, 10)
y_train: (430000, 1)
Best Answer
There are circumstances when the two losses are equivalent, but those circumstances do not exist in OP's code.
In a comment, OP writes that they only have one output neuron.
With 1 output neuron and binary cross-entropy, the model outputs a single value $p$ abd loss for one example is computed as
$$ L_b = -y \log p - (1 - y) \log (1 - p), $$ which is the correct way to compute the loss.
However, with 1 output neuron and categorical cross-entropy, the loss is computed as
$$ L_c = -y \log p $$
which is clearly different because it fixes $(1-y) \log(1-p)=0$. This loss is obviously bogus because it is minimized at $L_c = 0$ by setting $p=1$ regardless of the input, resulting in a totally useless model.
To use categorical cross-entropy correctly, OP needs to make these changes
After making these changes, the loss will be computed correctly when using categorical cross-entropy. This is because what we want to have is the model outputs $p_1, p_2$ so that the loss is
$$ L_c = -y \log p_1 - (1 - y) \log p_2 $$
where $ 0 \le p_i \le 1$ and $p_1 + p_2 = 1$. In this setting, it's simple algebra to show that $L_c = L_b$, as desired.