Solved – Multi-label classification problem: choosing the right threshold value for y = 1

classificationerrormultilabel

I am tasked with evaluating the performance of an image classifier which is based on a very high-level API. I only have access to the predicted label values for the training and validation data set.

There is a total of 8 labels and each image can be assigned multiple labels, so, for example, the actual output vector y for an image with labels 1, 4 and 5 would be [1,0,0,1,1,0,0,0], while the predicted output vector y_hat would be [0.7, 0.3, 0.5, 0.6, 0.8, 0.2, 0.1, 0.4].

What would be a reliable way of evaluating the performance/error on the training and validation data sets?

Based on what I read in the literature, it seems like hamming loss is most commonly used, but I am not sure what threshold value I should pick for the predicted labels to be accepted as true (i.e. equal to 1). Should I try different arbitrary values, e.g. 0.5, 0.7, 0.9, and see what kind of hamming loss I get?

Any practical advice for dealing with the measurement of error in a multi-label classification problem is welcome.

Best Answer

It is actually extremely important what perfomance measure you select and it should be relevant to the domain you are working on. Dembczynski et. al have a great paper that shows that selecting a measure defines which classifier one should be using.

In my netsci-x 2015 presentation I was arguing that it is important to understand the structure of label co-occurence in your data, explore them a little, build a graph - according to this paper - and see what communities exist. What is the structure of label relationships? Only then you can decide if hamming loss (which ignores label correlations) is a good measure of generalization quality, maybe jaccard or 0-1 accuracy are better. What does it mean for your problem to miss a label out of a set of labels? Is it more importand to recall relevant documents or maybe it's important to have a high precision. Is a proportion of properly assigned labels to all assigned important? What is the cost of making a mistake in your classification in real life? Which cost is the highest, it will be different in classifying patients with syndromes in medicine and in classification journalism texts for publishing.

Different kinds of label co-occurrence patterns can appear. Here are examples for 3 out of many benchmark data sets you can find in MULAN's page.

The histogram for emotions data set shows that 80% of labels are much more likely to occur together with other labels than as singletons.

The oposite is with the scene data set - which has the same number of labels as emotions - most of its labels appear as singletons in training data:

So labels can be correlated and uncorrelated. But the thing is - they can be both - see the medical data set:

So you can actually have a structure in the label space - you can check this paper of mine to see how to explore it using community detection method. A nice set of methods for this is implemented in scikit-multilearn if you're a python fan.

Related Solutions

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Best Answer

Related Solutions

Solved – scikit multi label classification

Related Question