Solved – Multi-label classification problem: choosing the right threshold value for y = 1

classificationerrormultilabel

I am tasked with evaluating the performance of an image classifier which is based on a very high-level API. I only have access to the predicted label values for the training and validation data set.

There is a total of 8 labels and each image can be assigned multiple labels, so, for example, the actual output vector y for an image with labels 1, 4 and 5 would be [1,0,0,1,1,0,0,0], while the predicted output vector y_hat would be [0.7, 0.3, 0.5, 0.6, 0.8, 0.2, 0.1, 0.4].

What would be a reliable way of evaluating the performance/error on the training and validation data sets?

Based on what I read in the literature, it seems like hamming loss is most commonly used, but I am not sure what threshold value I should pick for the predicted labels to be accepted as true (i.e. equal to 1). Should I try different arbitrary values, e.g. 0.5, 0.7, 0.9, and see what kind of hamming loss I get?

Any practical advice for dealing with the measurement of error in a multi-label classification problem is welcome.

Best Answer

It is actually extremely important what perfomance measure you select and it should be relevant to the domain you are working on. Dembczynski et. al have a great paper that shows that selecting a measure defines which classifier one should be using.

In my netsci-x 2015 presentation I was arguing that it is important to understand the structure of label co-occurence in your data, explore them a little, build a graph - according to this paper - and see what communities exist. What is the structure of label relationships? Only then you can decide if hamming loss (which ignores label correlations) is a good measure of generalization quality, maybe jaccard or 0-1 accuracy are better. What does it mean for your problem to miss a label out of a set of labels? Is it more importand to recall relevant documents or maybe it's important to have a high precision. Is a proportion of properly assigned labels to all assigned important? What is the cost of making a mistake in your classification in real life? Which cost is the highest, it will be different in classifying patients with syndromes in medicine and in classification journalism texts for publishing.

Different kinds of label co-occurrence patterns can appear. Here are examples for 3 out of many benchmark data sets you can find in MULAN's page.

The histogram for emotions data set shows that 80% of labels are much more likely to occur together with other labels than as singletons. enter image description here

The oposite is with the scene data set - which has the same number of labels as emotions - most of its labels appear as singletons in training data: enter image description here

So labels can be correlated and uncorrelated. But the thing is - they can be both - see the medical data set: enter image description here

So you can actually have a structure in the label space - you can check this paper of mine to see how to explore it using community detection method. A nice set of methods for this is implemented in scikit-multilearn if you're a python fan.

Related Question