Solved – How to use negative examples (in addition to positive ones) for training a multiclass softmax classifier (or a neural net with softmax output)

classificationmachine learningmulti-classneural networkssoftmax

Suppose we are training a neural network for multi-class classification, and we use softmax (or hierarchical softmax) as its output layer.

For positive examples, we need to maximize the log likelihood of training examples. We can calculate the gradient of the negative log likelihood for each example, and do stochastic gradient descent.

My question is, if we also have negative examples (e.g. some prediction the model made in the past that was later judged by human to be incorrect), how do we incorporate them into the training process? It sounds right to me to minimize the likelihood of making these predictions; is it okay to simply negate the gradient? (n.b. initial tests show that this causes the model to diverge.)

Best Answer

It helps to think about the process in probabilistic terms. For classification you're trying to infer $p(y|x)$, where $y$ is a one-hot encoded label vector, and $x$ is a sample you're classifying, like an image. Nowadays people mostly use neural networks to estimate this distribution, that is $p(y|x) = y^T\text{Softmax}(f(x; \Theta))$ where $f(x; \Theta)$ is some neural net mapping from images to an arbitrary $K$-dimensional vector (where $K$ is a number of classes).

So far I've only described the model, but didn't tell how to actually make inference, that is, how to update the neural net given data $\{(x_n, y_n)\}_{n=1}^N$. The standard way is to do maximum likelihood estimation, that is, maximize probability of the observed data under our model:

$$ \hat \Theta = \text{argmax}_{\Theta} \prod_{n=1}^N y_n^T \text{Softmax}(f(x_n; \Theta)) = \text{argmax}_{\Theta} \sum_{n=1}^N \log \left( y_n^T \text{Softmax}(f(x_n; \Theta)) \right) $$

Since each $y$ is one-hot, we can further simplify this expression:

$$ \hat \Theta = \text{argmax}_{\Theta} \sum_{n=1}^N \left( y_n^T f(x_n; \Theta) - \log \sum_{k=1}^K \exp(f_k(x_n; \Theta)) \right) $$

Which is your typical cross-entropy "loss" (technically, it's not a loss since we're maximizing it).

So it means, that if you know the right label for the sample $x$, you'll already punish it for not predicting the right one.

If, however, you don't know the right label, but know that it's not the label $l$, you can just maximize the probability $p(y \not= l |x)$ for a given sample. The corresponding term in the loss will be

$$ \log p(y \not=l \mid x) = \log(1 - p(y=l \mid x)) = \log \sum_{k=1\\k\not=l}^K \exp(f_k(x_n; \Theta)) - \log \sum_{k=1}^K \exp(f_k(x_n; \Theta)) $$

(Note: this expression involves logarithms, sums and probabilities, so naive implementation might suffer from numerical issues)