Solved – Classification with noisy labels

loss-functionsmachine learningneural networksnoise

I'm trying to train a neural network for classification, but the labels I have are rather noisy (around 30% of the labels are wrong).

The cross-entropy loss indeed works, but I was wondering are there any alternatives more effective in this case? or is cross-entropy loss the optimal?

I'm not sure but I'm thinking of somewhat "clipping" the cross-entropy loss, such that the loss for one data point will be no greater than some upper bound, will that work?

Thanks!

Update
According to Lucas' answer, I got the following for the derivatives for the prediction output $y$ and input of the softmax function $z$. So I guess essentially it is adding a smoothing term $\frac{3}{7N}$ to the derivatives.
$$p_i=0.3/N+0.7y_i$$
$$l=-\sum t_i\log(p_i)$$
$$\frac{\partial l}{\partial y_i}=-t_i\frac{\partial\log(p_i)}{\partial p_i}\frac{\partial p_i}{\partial y_i}=-0.7\frac{t_i}{p_i}=-\frac{t_i}{\frac{3}{7N}+y_i}$$
$$\frac{\partial l}{\partial z_i}=0.7\sum_j\frac{t_j}{p_j}\frac{\partial y_j}{\partial z_i}=y_i\sum_jt_j\frac{y_j}{\frac{3}{7N}+y_j}-t_i\frac{y_i}{\frac{3}{7N}+y_i}$$
Derivatives for the original cross-entropy loss:
$$\frac{\partial l}{\partial y_i}=-\frac{t_i}{y_i}$$
$$\frac{\partial l}{\partial z_i}=y_i-t_i$$
Please let me know if I'm wrong. Thanks!

Update
I just happened to read a paper by Google that applies the same formula as in Lucas' answer but with different interpretations.

In Section 7 Model Regularization via Label Smoothing

This (the cross entropy loss), however, can cause two problems. First, it may result in
over-fitting: if the model learns to assign full probability to the
groundtruth label for each training example, it is not guaranteed to
generalize. Second, it encourages the differences between the largest
logit and all others to become large, and this, combined with the
bounded gradient $∂l/∂z_k$, reduces the ability of the model to adapt.
Intuitively, this happens because the model becomes too confident
about its predictions.

But instead of adding the smoothing term to the predictions, they added it to the ground truth, which turned out to be helpful.

enter image description here

In our ImageNet experiments with K = 1000 classes, we used u(k) =
1/1000 and $\epsilon$ = 0.1. For ILSVRC 2012, we have found a consistent
improvement of about 0.2% absolute both for top-1 error and the top-5
error.

Best Answer

The right thing to do here is to change the model, not the loss. Your goal is still to correctly classify as many data points as possible (which determines the loss), but your assumptions about the data have changed (which are encoded in a statistical model, the neural network in this case).

Let $\mathbf{p}_t$ be a vector of class probabilities produced by the neural network and $\ell(y_t, \mathbf{p}_t)$ be the cross-entropy loss for label $y_t$. To explicitly take into account the assumption that 30% of the labels are noise (assumed to be uniformly random), we could change our model to produce

$$\mathbf{\tilde p}_t = 0.3/N + 0.7 \mathbf{p}_t$$

instead and optimize

$$\sum_t \ell(y_t, 0.3/N + 0.7 \mathbf{p}_t),$$

where $N$ is the number of classes. This will actually behave somewhat according to your intuition, limiting the loss to be finite.

Related Question