Classification – Multi-Label vs. Multi-Class Classification: Differences and Uses

classificationdeep learningmulti-classmultilabel

I'm having a hard time getting the difference between multi-class and multi-label classification with CNNs.

My understanding is that if I want to

  • classify different breeds of dogs, that is a multi-label classification as I have the same class of images and different labels

  • classify dogs and cats, that is a multi-label classification as I have different images to recognise

Now, what if my images contain a black/white dog/cat? In this case there are two classes (animal, colour) and two labels for each class. How do I build a classifier for this? I'd like to have as output a prediction for both classes (as I am assuming that any image can be labelled in those classes).

I was thinking of training two classifiers, one for each class, but in this way I'd lose the link between classes (that may be important). The second idea is to use a single class containing all the possible permutations of the labels. However, this solution doesn't fully convince me as the number of these labels will increase pretty fast as I add more single labels (i.e., animals and colours).

Are my intuitions correct? What is the proper way to tackle such kind of problems? Thanks

Best Answer

Definitions.

In a classification task, your goal is to learn a mapping $h: X\rightarrow Y$ (with your favourite ML algorithm, e.g CNNs). We make two common distinctions:

  • Binary vs multiclass: In binary classification, $\left|Y\right|=2$ (e.g, a positive category, and a negative category). In multiclass classifcation, $\left|Y\right|=k$ for some $k\in\mathbb{N}$. In other words, this is just a matter of "how many possible answers are there".
  • Single-label vs multilabel: This refers to how many possible outcomes are possible for a single example $x\in X$. This refers to whether your chosen categories are mutually exclusive, or not. For example, if you are trying to predict the color of an object, then you're probably doing single label classification: a red object can not be a black object at the same time. On the other hand, if you're doing object detection in an image, then since one image can contain multiple objects in it, you're doing multi-label classification.

Effect on network architecture. The first distinction determines the number of output units (i.e, number of neurons in the final layer). The second distinction determines which choice of activation function for the final layer + loss function you should you. For single-label, the standard choice is softmax with categorical cross-entropy; for multi-label, switch to sigmoid activations with binary-cross entropy. See here for a more detailed discussion on this question.

Creating "hybrid" combinations. I'll describe an example similar to the one in your question. Suppose I'm trying to classify animals, and I'm interested in recognizing the following:

  • color (black, white, orange)
  • size (small, medium, large)
  • type (cat, dog, chimpanzee)

This looks confusing: some of the labels are mutually exclusive (an animal can't be both black and orange) and others aren't (it can be a black dog). In this case, the solution is to perform multi-class classification with $k=3\cdot 3=9$ (or generally, number of categories times the size of the largest category; in this case all categories were of equal length, 3). You just have to define the loss function carefully: You would apply a softmax activation for each group of 3 (each category) and compare that to the true label. I created a little sketch which I think makes it clear:

enter image description here

So the final loss is $L(\hat y, y)=CE_{color} + CE_{size}$. The entire idea here is that we exploited information about the structure of the labels (which are mutually exclusive and which aren't) to significantly reduce the number of outputs (from an exponential number - all combinations, in this case $3^3$ - to a multiplicative number, $3\cdot 3$).

Related Question