Solved – Cross Entropy vs. Sparse Cross Entropy: When to use one over the other

conv-neural-networkcross entropyinformation theoryloss-functionsmachine learning

I am playing with convolutional neural networks using Keras+Tensorflow to classify categorical data. I have a choice of two loss functions: categorial_crossentropy and sparse_categorial_crossentropy.

I have a good intuition about the categorial_crossentropy loss function, which is defined as follows:

$$
J(\textbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \text{log}(\hat{y}_i) + (1-y_i) \text{log}(1-\hat{y}_i) \right]
$$

where,

  • $\textbf{w}$ refer to the model parameters, e.g. weights of the neural network
  • $y_i$ is the true label
  • $\hat{y_i}$ is the predicted label

Both labels use the one-hot encoded scheme.

Questions:

  • How does the above loss function change in sparse_categorial_crossentropy?
  • What is the mathematical intuition behind it?
  • When to use one over the other?

Best Answer

Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention $Y_i$ (i,e true labels).

If your $Y_i$'s are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]

But if your $Y_i$'s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.