Solved – Cross entropy-equivalent loss suitable for real-valued labels

cross entropyloss-functionsmachine learningneural networksoptimization

I am building a model whose outputs are between 0-1 and the goal is to minimize a cost function over the predicted values and labels. So far everything seems to be easy but my labels are real-valued and therefore, I cannot use the ordinary cross entropy loss function. For instance, suppose the predicted value is 0.2 and the label value is 0.2. Then simply applying the cross entropy will not give the output of 0 (as the desired output) and still generates gradients unnecessarily. Another problem is when the prediction value is greater than the corresponding label the cost ordinary cross entropy does not output sensible outputs. So I wonder, is there an equivalent version of the cross entropy function which deals with continuous [0, 1] labels?

I may note that I have used loss functions such as the L2, L1 and squared loss and so far the ordinary cross entropy loss is getting me the best results! So that's why I think a cross entropy loss which is suitable for continuous labels will even work better.

Best Answer

Cross entropy is defined on probability distributions, not single values. The reason it works for classification is that classifier output is (often) a probability distribution over class labels. For example, the outputs of logistic/softmax functions are interpreted as probabilities. The observed class label is also treated as a probability distribution: the empirical distribution (where the probability is 1 for the observed class and 0 for the others).

The concept of cross entropy applies equally well to continuous distributions. But, it can't be used for regression models that output a point estimate (e.g. the conditional mean) but not a full probability distribution. If you had a model that gave the full conditional distribution (probability of output given input), you could use cross entropy as a loss function.

For continuous distributions $p$ and $q$, the cross entropy is defined as: $$H(p, q) = -\int_{Y} p(y) \log q(y) dy$$

Just considering a single observed input/output pair $(x, y)$, $p$ would be the empirical conditional distribution (a delta function over the observed output value), and $q$ would be the modeled conditional distribution (probability of output given input). In this case, the cross entropy reduces to $-\log q(y \mid x)$. Summing over data points, this is just the negative log likelihood!

Related Question