Solved – cross-entropy-like loss function for multiple classes where misclassification costs are not identical

classificationcross entropylog-lossloss-functionsmulti-class

For this conversation I'll use the below definition of cross-entropy where there are N samples, M different classes, $ y_{ij} $ is 1 if sample i is of class j and 0 otherwise and $p_{ij}$ is the probability that a sample i is of class j as assigned by some model.

$$
\text{Cross Entropy Loss} = -\frac{1}{N}\sum_{i}^N \sum_{j}^M y_{ij} log(p_{ij})
$$

Cross entropy assumes that all misclassification costs are of equal importance. So for a given sample of class 1 if an algorithm assigns $p_1 = 0.2$ to class 1 then it doesn't matter how the rest of the probability mass is distributed across the rest of the classes, the loss will be the same. This can be seen from the above definition where the loss depends only on the probability assigned to the true class and has no dependence on how the rest of the probability is distributed.

Imagine we're now in a situation where different misclassifications have different costs eg for a sample of class 1 it costs us twice as much to incorrectly classify it as class 2 than as class 3. Is there a variant of cross entropy which can take into account cost differences of misclassifications?

The best I can come up with is something like:

$$
\text{Loss} = -\frac{1}{N}\sum_{i}^N \sum_{j}^M\sum_{k\neq j}^M y_{ij} C_{jk}log(1-p_{ik})
$$

Where C is a matrix of shape M x M where element $C_{jk}$ is the relative cost of classifying a sample with true class j as class k.

This loss function has some of the right properties in that I can specify relative costs for different types of misclassifications but if all the costs are the same it won't be equal to cross entropy and will have some slightly strange behaviour (for a sample of class 1 if the algorithm predicts $p_1$ for class 1 then it will be minimised when $ p_2 = p_3 $ rather than being agnostic on the values of $p_2$ and $p_3$ as cross entropy is).

Best Answer

Step back and frame the problem more generally.

Let P = probability matrix, where $P_{ij}$ = probability of assigning an item in truth class i to class j. All rows of P sum to 1.

Create a matrix-valued cost function $C(P)$, where $C_{ij}$ = cost incurred due to having probability $P_{ij}$ of assigning an item in truth class $i$ to class $j$. Cost function elements on the diagonal are 0 when the corresponding $P$ entry = 1, and costs on the off-diagonal are 0 when the corresponding $P$ entry = 0. I.e., $C_{ii}(P) = 0$ if $P_{ii}$ = 1; and for $i \ne j$, $C_{ij}(P) = 0$ if $P_{ij}$ = 0.

The objective is to minimize $\Sigma_i \Sigma_j C_{ij}(P)$. The sum of off-diagonal costs need not be equal across rows if the truth classes are of unequal importance. So this framework allows weighting by truth class and by into what class an item is misclassified.

Your proposed loss function's lack of agnosticism is due to nonlinearity of your cost function, and in this case, driving to equality due to the convexity of -log.

If you use linear cost elements, you will achieve agnosticism. For instance,the diagonal elements $C_{ii}$ of $C(P)$ could be $c_{ii}∗(1−Pii)$, and off-diagonal elements $C_{ij}$ could be $c_{ij}∗Pij$. If you adopt this linear structure, all you have to do is specify the $c_{ij}$ 's for all i and j.

You could still achieve agnosticism by using this linear cost structure for off-diagonal elements, while having nonlinear costs for diagonal elements of $C(P)$. If you use linear cost structure for off-diagonal elements, with all off-diagonal $c_{ij}$ for a given i equal to a common (i.e., unweighted by misclassification class) value $c_i$, and choose $C_{ii}(P) = -log(P_{ii}) - (1-P{ii}) (M - 1)c_i$, then this reduces to the standard cross-entropy loss (i'm not worrying about the factor 1/N).