Solved – Classification with noisy labels

loss-functionsmachine learningneural networksnoise

I'm trying to train a neural network for classification, but the labels I have are rather noisy (around 30% of the labels are wrong).

The cross-entropy loss indeed works, but I was wondering are there any alternatives more effective in this case? or is cross-entropy loss the optimal?

I'm not sure but I'm thinking of somewhat "clipping" the cross-entropy loss, such that the loss for one data point will be no greater than some upper bound, will that work?

Thanks!

Update
According to Lucas' answer, I got the following for the derivatives for the prediction output $y$ and input of the softmax function $z$. So I guess essentially it is adding a smoothing term $\frac{3}{7N}$ to the derivatives.
$$p_i=0.3/N+0.7y_i$$
$$l=-\sum t_i\log(p_i)$$
$$\frac{\partial l}{\partial y_i}=-t_i\frac{\partial\log(p_i)}{\partial p_i}\frac{\partial p_i}{\partial y_i}=-0.7\frac{t_i}{p_i}=-\frac{t_i}{\frac{3}{7N}+y_i}$$
$$\frac{\partial l}{\partial z_i}=0.7\sum_j\frac{t_j}{p_j}\frac{\partial y_j}{\partial z_i}=y_i\sum_jt_j\frac{y_j}{\frac{3}{7N}+y_j}-t_i\frac{y_i}{\frac{3}{7N}+y_i}$$
Derivatives for the original cross-entropy loss:
$$\frac{\partial l}{\partial y_i}=-\frac{t_i}{y_i}$$
$$\frac{\partial l}{\partial z_i}=y_i-t_i$$
Please let me know if I'm wrong. Thanks!

Update
I just happened to read a paper by Google that applies the same formula as in Lucas' answer but with different interpretations.

In Section 7 Model Regularization via Label Smoothing

This (the cross entropy loss), however, can cause two problems. First, it may result in
over-fitting: if the model learns to assign full probability to the
groundtruth label for each training example, it is not guaranteed to
generalize. Second, it encourages the differences between the largest
logit and all others to become large, and this, combined with the
bounded gradient $∂l/∂z_k$, reduces the ability of the model to adapt.
Intuitively, this happens because the model becomes too confident
about its predictions.

But instead of adding the smoothing term to the predictions, they added it to the ground truth, which turned out to be helpful.

In our ImageNet experiments with K = 1000 classes, we used u(k) =
1/1000 and $\epsilon$ = 0.1. For ILSVRC 2012, we have found a consistent
improvement of about 0.2% absolute both for top-1 error and the top-5
error.

Best Answer

The right thing to do here is to change the model, not the loss. Your goal is still to correctly classify as many data points as possible (which determines the loss), but your assumptions about the data have changed (which are encoded in a statistical model, the neural network in this case).

Let $\mathbf{p}_t$ be a vector of class probabilities produced by the neural network and $\ell(y_t, \mathbf{p}_t)$ be the cross-entropy loss for label $y_t$. To explicitly take into account the assumption that 30% of the labels are noise (assumed to be uniformly random), we could change our model to produce

$$\mathbf{\tilde p}_t = 0.3/N + 0.7 \mathbf{p}_t$$

instead and optimize

$$\sum_t \ell(y_t, 0.3/N + 0.7 \mathbf{p}_t),$$

where $N$ is the number of classes. This will actually behave somewhat according to your intuition, limiting the loss to be finite.

Related Solutions

Solved – Cross entropy-equivalent loss suitable for real-valued labels

Cross entropy is defined on probability distributions, not single values. The reason it works for classification is that classifier output is (often) a probability distribution over class labels. For example, the outputs of logistic/softmax functions are interpreted as probabilities. The observed class label is also treated as a probability distribution: the empirical distribution (where the probability is 1 for the observed class and 0 for the others).

The concept of cross entropy applies equally well to continuous distributions. But, it can't be used for regression models that output a point estimate (e.g. the conditional mean) but not a full probability distribution. If you had a model that gave the full conditional distribution (probability of output given input), you could use cross entropy as a loss function.

For continuous distributions $p$ and $q$, the cross entropy is defined as: $$H(p, q) = -\int_{Y} p(y) \log q(y) dy$$

Just considering a single observed input/output pair $(x, y)$, $p$ would be the empirical conditional distribution (a delta function over the observed output value), and $q$ would be the modeled conditional distribution (probability of output given input). In this case, the cross entropy reduces to $-\log q(y \mid x)$. Summing over data points, this is just the negative log likelihood!

Solved – Backpropagation algorithm NN with Rectified Linear Unit (ReLU) activation

As for the confusing part. Softmax derivative is simply

$$\frac{\partial L}{\partial t_i} = t_i - y_i$$

where $t_i$ is predicted output. Now, in this case $t_i, y_i \in \Re^3$, but $y_i$ has to be in the one-hot encoding form which looks like this

$$y_i = (0, \dots, \overset{\text{k'th}}{1}, \dots, 0)$$

So, for example class 2 is represented as $(0, 1, 0)$

And in your code we have $y.index$ be the position of $1$ in above example. So the line

dscores[Y.index] <- dscores[Y.index] - 1

Is a short and clever way for applying derivative to a whole dataset, i.e.

$$t_i - y_i \equiv t_i[k] = t_i[k] - 1, \text{ for } y_i = (0, \dots, \overset{\text{k'th}}{1}, \dots, 0)$$

Backpropagation derivatives

Notation:

$i^k$ - input vector of $k$th layer
$o^k$ - output vector of $k$th layer
$W^k$ - transition matrix of $k$th layer
$b^k$ - biases of $k$th layer
$X = o^0$ - network input
$T$ - predicted output
$T = o^n$ assuming we have $n$ layers

We have a transition between layers

$$i^{k} = W^ko^{k-1} + b^k$$

$ReLU$ as activation function

$$o^k = max(0, i^k)$$

where we assume element-wise application.

Now the important part

$$\frac{\partial L}{\partial o^{k-1}} = W^k\frac{\partial L}{\partial i^k} \ \ \ \ \text{ (1)}$$

$$ \frac{\partial L}{\partial i^k} = \frac{\partial L}{\partial o^k} \circ I[o^k > 0] \ \ \ \ \text{ (2)} $$

where $\circ$ means Hadamard product (element-wise)

$$\frac{\partial L}{\partial W^k} = \frac{\partial L}{\partial i^k}(o^{k-1})^T$$

$$\frac{\partial L}{\partial b^k} = \frac{\partial L}{\partial i^k} \ \ \ \ \text{ (3)}$$

Since $\frac{\partial i^k}{\partial b^k} = I$ (identity matrix) and we use chain rule

You can check that above equations hold when $i, o$ are matrices, i.e. we process whole batches at once. Now we can also update weights for a whole batch as follows

\begin{align}\frac{\partial \sum_{l=1}^d L_{(l)}}{\partial W^k} &= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial W^k} \\ &= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial i_{(l)}^k}(o_{(l)}^{k-1})^T \\ &= \begin{pmatrix} \frac{\partial L_{(1)}}{\partial i_{(1)}^k} & \dots & \frac{\partial L_{(d)}}{\partial i_{(d)}^k} \end{pmatrix} \begin{pmatrix} o_{(1)}^{k-1})^T \\ \dots \\ o_{(d)}^{k-1})^T \end{pmatrix} \\ &= \frac{\partial \sum_{l=1}^d L_{(l)}}{\partial I^k} (O^{k-1})^T \ \ \ \ \text{ (4)} \end{align}

dscores <- dscores / batchsize

Since we process a whole batch at once and derivative of ReLU is constant we can divide $T - Y$ at start

dW2 <- t(hidden.layer) %*% dscores # [6 x 3] matrix

Using (4)

db2 <- colSums(dscores)            # a vector of 3 numbers (sums for each species)

Apply (3) to a whole batch

dhidden <- dscores %*% t(W2) # [90 x 6] matrix]
# W2 is a [6 x 3] matrix of weights.
# dhidden is [90 x 6]

Applying (1)

dhidden[hidden.layer <= 0] <- 0 # We get rid of negative values

Using (2)

dW1 <- t(X) %*% dhidden # X is the input matrix [90 x 4]
db1 <- colSums(dhidden) # 6-element vector (sum columns across examples)

Again using (4) and (3) to a whole batch for layer 1

# update ....
dW2 <- dW2  + reg * W2 # reg is regularization rate reg = 1e-3
dW1 <- dW1  + reg * W1

W1  <- W1 - lr * dW1    # lr is the learning rate lr = 1e-2
b1  <- b1 - lr * db1    # b1 is the first bias

W2  <- W2 - lr * dW2
b2  <- b2 - lr * db2    # b2 is the second bias

Updates for a whole batch

Best Answer

Related Solutions

Solved – Cross entropy-equivalent loss suitable for real-valued labels

Solved – Backpropagation algorithm NN with Rectified Linear Unit (ReLU) activation

Backpropagation derivatives

Related Question