Softmax – Using Softmax Activation Function for Multi-Class Classification

classificationconv-neural-networksoftmax

In the last layer of  'CNNs'   it is common to use softmax activation functions for multi-class classification.I would like to know if is it necessary using a softmax activation function when creating a CNN for image classification task, and does it have nothing to do with the optimizer used to train the model ?

Best Answer

Softmax outputs a probability vector. That means that

  • the elements are nonnegative and
  • the elements sum to 1.

To train a classification model with $m \ge 3$ classes, the standard approach is to use softmax as the final activation with multinomial cross-entropy loss. For a single instance, the loss is

$$ \begin{align} \mathcal{L} &= -\sum_{j=1}^m y_j \log(p_j) \end{align} $$

where $y$ is a vector with one value of 1 and the rest zero and $p_j$ are our predicted probabilities from the softmax. If the single value of 1 in $y$ is at index $k$, then the loss achieves a minimum value of 0 when $p_k = 1$. When $p_k=1$, this implies that the rest of the $p_{j\neq k}$ are all 0 (because $p$ is a vector of probabilities, so the total is 1).

In a comment, OP proposes using ReLU instead of softmax. However, there are some problems with this proposal.

  1. You can still encounter $\log(0)$, because ReLU can return zeros. (But this is not fatal, because we can "patch" it; a strictly positive ReLU activation like $\text{ReLU}(x)+\epsilon$ for some small $\epsilon>0$ avoids this.)

  2. For ReLUs, the sum of $p$ can be any nonnegative value. This is not a probability. Because $-\log(p_k)$ decreases without bound as $p_k$ increases, the model will never stop training. (But this isn't fatal; penalizing the weights and biases or or otherwise constraining them will prevent them from drifting away to $\pm\infty$.) On the other hand, for softmax, the largest $p_k$ can ever be is 1, so minimum loss is 0.

  3. ReLU does not force a tradeoff among the units, whereas softmax does. What this means is that if you use softmax want to increase the value of $p_k$, you have to decrease $\sum_{i\neq k} p_i$. The loss will be high whenever $p$ and $y$ are different. By contrast, the ReLU model can just return some vector of constants and have the same loss, no matter what the label is. Consider the three-class case where the correct prediction is the second class, we have $$\mathcal{L}=-0\times \log(c)-1\times\log(c)-0\times\log(c)=-\log(c).$$ Likewise, this same loss of is obtained for the same $p$ and any label vector $y$.

Clearly, (3) is fatal because the model has no useful information about which class is the most likely. A model that can always reduce the loss by ignoring the input entirely is a bogus model.

The key detail about softmax is that it forces a tradeoff among the values of $p$, because assigning any probability to the incorrect class is penalized. The only softmax model which has 0 multinomial cross-entropy loss is the model that assigns probability of 1 to the correct class for all instances.


Softmax isn't the only function you could use. A function like

$$ f(z)_i = \frac{\text{softplus}(z_i)}{\sum_i \text{softplus}(z_i)} $$ where the softplus function is

$$ \text{softplus}(x)=\log(1+\exp(x)) $$ could also work for a multi-class classification model because $f$ is

  • positive (avoids divide by zero),
  • non-negative and sums to 1 (is a probability), and
  • monotonic increasing.

We care about monotonicity because we want the property that large $z_i$ imply large probabilities. A non-monotonic function like squaring or absolute value means that we predict a certain class for very large or very small values. See: Why is softmax function used to calculate probabilities although we can divide each value by the sum of the vector?

Related Question