Solved – Is it okay to use cross entropy loss function with soft labels

classificationloss-functions

I have a classification problem where pixels will be labeled with soft labels (which denote probabilities) rather than hard 0,1 labels. Earlier with hard 0,1 pixel labeling the cross entropy loss function (sigmoidCross entropyLossLayer from Caffe) was giving decent results. Is it okay to use the sigmoid cross entropy loss layer (from Caffe) for this soft classification problem?

Best Answer

The answer is yes, but you have to define it the right way.

Cross entropy is defined on probability distributions, not on single values. For discrete distributions $p$ and $q$, it's: $$H(p, q) = -\sum_y p(y) \log q(y)$$

When the cross entropy loss is used with 'hard' class labels, what this really amounts to is treating $p$ as the conditional empirical distribution over class labels. This is a distribution where the probability is 1 for the observed class label and 0 for all others. $q$ is the conditional distribution (probability of class label, given input) learned by the classifier. For a single observed data point with input $x_0$ and class $y_0$, we can see that the expression above reduces to the standard log loss (which would be averaged over all data points):

$$-\sum_y I\{y = y_0\} \log q(y \mid x_0) = -\log q(y_0 \mid x_0)$$

Here, $I\{\cdot\}$ is the indicator function, which is 1 when its argument is true or 0 otherwise (this is what the empirical distribution is doing). The sum is taken over the set of possible class labels.

In the case of 'soft' labels like you mention, the labels are no longer class identities themselves, but probabilities over two possible classes. Because of this, you can't use the standard expression for the log loss. But, the concept of cross entropy still applies. In fact, it seems even more natural in this case.

Let's call the class $y$, which can be 0 or 1. And, let's say that the soft label $s(x)$ gives the probability that the class is 1 (given the corresponding input $x$). So, the soft label defines a probability distribution:

$$p(y \mid x) = \left \{ \begin{array}{cl} s(x) & \text{If } y = 1 \\ 1-s(x) & \text{If } y = 0 \end{array} \right .$$

The classifier also gives a distribution over classes, given the input:

$$ q(y \mid x) = \left \{ \begin{array}{cl} c(x) & \text{If } y = 1 \\ 1-c(x) & \text{If } y = 0 \end{array} \right . $$

Here, $c(x)$ is the classifier's estimated probability that the class is 1, given input $x$.

The task is now to determine how different these two distributions are, using the cross entropy. Plug these expressions for $p$ and $q$ into the definition of cross entropy, above. The sum is taken over the set of possible classes $\{0, 1\}$:

$$ \begin{array}{ccl} H(p, q) & = & - p(y=0 \mid x) \log q(y=0 \mid x) - p(y=1 \mid x) \log q(y=1 \mid x)\\ & = & -(1-s(x)) \log (1-c(x)) - s(x) \log c(x) \end{array} $$

That's the expression for a single, observed data point. The loss function would be the mean over all data points. Of course, this can be generalized to multiclass classification as well.

Related Solutions

Solved – Loss function for Logistic Regression

You got off on the wrong track as detailed here. Just because you have a binary $Y$ it doesn't mean that you should be interested in classification. You are really interested in a probability model, so logistic regression is a good choice. Get the nomenclature right or you will confuse everyone.

To the main point, the theory of statistical estimation shows that in the absence of outside information (which would make you use Bayesian logistic regression), maximum likelihood estimation is the gold standard for efficiency and bias. The log likelihood function provides the objective function.

You may have confused a loss/cost/utility function with estimation optimization. Get the optimum estimates using maximum likelihood estimation or penalized maximum likelihood (or better Bayesian modeling if you have constraints or other information). The a utility function comes in when needing to make an optimum decision to minimize expected loss (maximize expected utility). But I don't think you are asking about decision analysis. So stick with the gold standard objective function - the log likelihood.

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

For your problem, the good metric is the categorical_accuracy. What happens is that when you set the loss to be binary_crossentropy and metrics to accuracy then keras assumes that the good metric is binary_accuracy which is just plain wrong when there is more than 2 labels.

What you have to do is to specify explicitly that the metric is categorical_accuracy like this:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

see the details in this answer: https://stackoverflow.com/a/46038271/6338493

Best Answer

Related Solutions

Solved – Loss function for Logistic Regression

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

Related Question