Classification – Multi-Label vs. Multi-Class Classification: Differences and Uses

classificationdeep learningmulti-classmultilabel

I'm having a hard time getting the difference between multi-class and multi-label classification with CNNs.

My understanding is that if I want to

classify different breeds of dogs, that is a multi-label classification as I have the same class of images and different labels
classify dogs and cats, that is a multi-label classification as I have different images to recognise

Now, what if my images contain a black/white dog/cat? In this case there are two classes (animal, colour) and two labels for each class. How do I build a classifier for this? I'd like to have as output a prediction for both classes (as I am assuming that any image can be labelled in those classes).

I was thinking of training two classifiers, one for each class, but in this way I'd lose the link between classes (that may be important). The second idea is to use a single class containing all the possible permutations of the labels. However, this solution doesn't fully convince me as the number of these labels will increase pretty fast as I add more single labels (i.e., animals and colours).

Are my intuitions correct? What is the proper way to tackle such kind of problems? Thanks

Best Answer

Definitions.

In a classification task, your goal is to learn a mapping $h: X\rightarrow Y$ (with your favourite ML algorithm, e.g CNNs). We make two common distinctions:

Binary vs multiclass: In binary classification, $\left|Y\right|=2$ (e.g, a positive category, and a negative category). In multiclass classifcation, $\left|Y\right|=k$ for some $k\in\mathbb{N}$. In other words, this is just a matter of "how many possible answers are there".
Single-label vs multilabel: This refers to how many possible outcomes are possible for a single example $x\in X$. This refers to whether your chosen categories are mutually exclusive, or not. For example, if you are trying to predict the color of an object, then you're probably doing single label classification: a red object can not be a black object at the same time. On the other hand, if you're doing object detection in an image, then since one image can contain multiple objects in it, you're doing multi-label classification.

Effect on network architecture. The first distinction determines the number of output units (i.e, number of neurons in the final layer). The second distinction determines which choice of activation function for the final layer + loss function you should you. For single-label, the standard choice is softmax with categorical cross-entropy; for multi-label, switch to sigmoid activations with binary-cross entropy. See here for a more detailed discussion on this question.

Creating "hybrid" combinations. I'll describe an example similar to the one in your question. Suppose I'm trying to classify animals, and I'm interested in recognizing the following:

color (black, white, orange)
size (small, medium, large)
type (cat, dog, chimpanzee)

This looks confusing: some of the labels are mutually exclusive (an animal can't be both black and orange) and others aren't (it can be a black dog). In this case, the solution is to perform multi-class classification with $k=3\cdot 3=9$ (or generally, number of categories times the size of the largest category; in this case all categories were of equal length, 3). You just have to define the loss function carefully: You would apply a softmax activation for each group of 3 (each category) and compare that to the true label. I created a little sketch which I think makes it clear:

So the final loss is $L(\hat y, y)=CE_{color} + CE_{size}$. The entire idea here is that we exploited information about the structure of the labels (which are mutually exclusive and which aren't) to significantly reduce the number of outputs (from an exponential number - all combinations, in this case $3^3$ - to a multiplicative number, $3\cdot 3$).

Related Solutions

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Solved – Multi-class classification with growing number of classes – question

This is not the only way, and it may not work for all problems, but one solution would be to compare the performance of a range of class numbers (the current number and one more, or the current number and one either side, or two either side - the number of classes you explore each update depends on how much computational effort you can spare) and use an information criterion (e.g. corrected Akaike's Information Criterion, AICc) to assess goodness-of-fit for each alternative. The model with the lowest AICc is the 'best' fit, although trivial differences (delta-AICc smaller than about 5-10) are not sufficient to conclude that either model is substantially better. You could go one step further and calculate relative likelihoods for different alternatives using Akaike weights.

I'd recommend taking a look at Burnham and Anderson (2002) "Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach".

Best Answer

Related Solutions

Solved – scikit multi label classification

Solved – Multi-class classification with growing number of classes – question

Related Question