Solved – Multi-class logarithmic loss function per class

classificationlogarithmloss-functionsmachine learningmulti-class

In a multi-classification problem, we define the logarithmic loss function $F$ in terms of the logarithmic loss function per label $F_i$ as:

$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i $$

where $N$ is the number of instances, $M$ is the number of different labels, $y_{ij}$ is the binary variable with the expected labels and $p_{ij}$ is the classificiation probability output by the classifier for the $i$-instance and the $j$-label.

The cost function $F$ measures the distance between two probability distributions, i.e. how similar is the distribution of actual labels and classifier probabilities. Hence, values close to zero are preferred.

However, the cost function per label $F_i$ has any meaning? It seems that is measuring how good our classifier is doing per label, but it is affected by the number of instances $N$ that don't contain this label.

Best Answer

As you rightly pointed out, a pure classifier (with probability 1) will have log loss of 0, which is the preferred case.

Consider a classifier that assigns labels in a completely random manner. Probability of assigning to the correct class will be 1/M. Therefore, the log loss for each observation will be -log(1/M) = log(M). This is label independent.

Log loss for an individual observation can be compared with this value to check how well the classifier is performing with respect to random classification. However, this may not make much sense. Let us take an example.

Consider a powerful classifier which misclassified an observation. Let us assume that the observation actually belongs to class 'x' and the predicted probability of belonging to class is 0 (nearly). Therefore, the individual and overall value of log loss will be Inf. This is very common and mostly ignored - it is an observation, but it does not comment on the overall accuracy of the classifier. However, we can make sense of this in 2 ways: Method 1: The observation could be an outlier. Remove it and run the classification again Method 2: Smooth the probability density function for class belongingness of all observations (not just the current observation)

Note: If you are concerned with the predicted probability of class belongingness and not just the predicted class, I strongly recommend you to look at method 2. It is generally studied in text retrieval (Language model); it may be relevant to your case.

Addition: e^(-loss) is the average probability of correct prediction. This value can be compared to that of random classification.

Related Solutions

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Solved – How does Binary Relevance work on multi-class multi-label problems

I think binary relevance should be done still in the way that you proposed.

For an overview of different methods for multilabel classification look here: https://journal.r-project.org/archive/2017/RJ-2017-012/RJ-2017-012.pdf I don't know how they could be adapted for multiclass multi-label problems. I would also rather call them multitarget problems.

Best Answer

Related Solutions

Solved – scikit multi label classification

Solved – How does Binary Relevance work on multi-class multi-label problems

Related Question