Solved – Multi-class logarithmic loss function per class

classificationlogarithmloss-functionsmachine learningmulti-class

In a multi-classification problem, we define the logarithmic loss function $F$ in terms of the logarithmic loss function per label $F_i$ as:

$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i $$

where $N$ is the number of instances, $M$ is the number of different labels, $y_{ij}$ is the binary variable with the expected labels and $p_{ij}$ is the classificiation probability output by the classifier for the $i$-instance and the $j$-label.

The cost function $F$ measures the distance between two probability distributions, i.e. how similar is the distribution of actual labels and classifier probabilities. Hence, values close to zero are preferred.

However, the cost function per label $F_i$ has any meaning? It seems that is measuring how good our classifier is doing per label, but it is affected by the number of instances $N$ that don't contain this label.

Best Answer

As you rightly pointed out, a pure classifier (with probability 1) will have log loss of 0, which is the preferred case.

Consider a classifier that assigns labels in a completely random manner. Probability of assigning to the correct class will be 1/M. Therefore, the log loss for each observation will be -log(1/M) = log(M). This is label independent.

Log loss for an individual observation can be compared with this value to check how well the classifier is performing with respect to random classification. However, this may not make much sense. Let us take an example.

Consider a powerful classifier which misclassified an observation. Let us assume that the observation actually belongs to class 'x' and the predicted probability of belonging to class is 0 (nearly). Therefore, the individual and overall value of log loss will be Inf. This is very common and mostly ignored - it is an observation, but it does not comment on the overall accuracy of the classifier. However, we can make sense of this in 2 ways: Method 1: The observation could be an outlier. Remove it and run the classification again Method 2: Smooth the probability density function for class belongingness of all observations (not just the current observation)

Note: If you are concerned with the predicted probability of class belongingness and not just the predicted class, I strongly recommend you to look at method 2. It is generally studied in text retrieval (Language model); it may be relevant to your case.

Addition: e^(-loss) is the average probability of correct prediction. This value can be compared to that of random classification.

Related Question