Solved – Weighting true positives in a loss function

classificationcross entropyloss-functionsregression

Let's start with the binary cross enntropy as defined in tensorflow:

$$\mathcal{L}(\mathbf y, \mathbf t)=-\frac {1}{N}\sum_n t_n\log y_n + (1-t_n)\log (1-y_n), $$
where $$ t_n \in \{0,1\}, y_n \in [0,1]$$

In order to weight false negatives, we can put a weight $\mathbf c_n$ as follows:

$$\mathcal{L}(\mathbf y, \mathbf c, \mathbf t)=-\frac {1}{N}\sum_n c_n t_n\log y_n + (1-t_n)\log (1-y_n),$$

The first term $\mathbf t_n log(y_n)$ describing the false negatives and the second part $(1-t_n)\log (1-y_n)$ being the punishment part for the false positives.

My question: Is there a way to include true positives in this loss function somehow? I would like to add an extra reward $\mathbf r_n$ for true positives, as they should be rewarded higher than true negatives. How could I do this in the context of binary cross entropy as forumlated above?

Some more context: I want to maximize a payoff that is achieved through binary action (true/false). The payoff is 10 for true positives, and -1 for false positives. For false negatives and true negatives the payoff is 0, but the loss function will need to assume an opportunity cost for false negatives of 10, while true negatives have no opportunity cost.

Best Answer

I think you're making the common mistake of treating logistic regression as a classifier. We don't have false negatives or positives because those require an assignment of labels. We are instead modeling the probabilities of a success so all we have are the modeled probabilities for the observations where the event happened, and where the event didn't happen. But depending on our threshold, a probability of $\hat y_i = 0.7$ may lead to either a positive or negative label.

In this light, $c_i$ doesn't actually represent a penalty for false negatives.

Our loss (ignoring multiplicative constants) can be rewritten as $$ \sum \limits_{i \, :\, t_i=1} \log \hat y_i + \sum \limits_{i \, :\, t_i=0} \log (1 - \hat y_i). $$ This means for each observation where the event happened (i.e. $t_i=1$) we get a contribution of $\log \hat y_i$, and analogously we get $\log (1-\hat y_i)$ for observation where the event did not happen (i.e. $t_i=0$).

If we add a $c_i$ term as you did, then we get $$ \sum \limits_{i \, :\, t_i=1} c_i \log \hat y_i + \sum \limits_{i \, :\, t_i=0} \log (1 - \hat y_i). $$ The effect is not that we're forcing the model to minimize false negatives, but rather we are affecting the contribution to the loss of the observations where the event happened. If we're maximizing then $c_i$ being large means our model is going to be encouraged to assign larger probabilities to the observations with $t_i=1$ even if the probabilities for $t_i=0$ observations suffer (although no finite $c_i$ will ever allow for $\hat y_i = 1$ when $t_i=0$, and in general any fixed $c_i$ can be overpowered by a really poorly aligned $t_i$ and $\hat y_i$). That does mean that for an a priori fixed threshold we'll likely see the false negative rate go down, although that's not because we're directly penalizing it but rather we're just encouraging our probabilities to be bigger. Similarly, this same modification will result in a relative increase in the true positive rate but for the exact same reason.