Solved – Should I use multi-label classification

classificationdata miningdatasetmachine learningmultilabel

I have a classification problem with 2 classes (positive and negative). Usually, in such classification problems, all the samples will be labelled either 'positive' or 'negative'. In my dataset, some of the samples possess a combination of both positive and negative characteristics. Formally, if the dataset is $x$, then,

$x=x_1 \cup x_2 \cup x_3$

where $x_1$ is the set of all positive samples, $x_2$ is the set of all negative samples and $x_3$ is the set of samples that contain the characteristics of both the classes.
As far as I could think of, this situation could be handled in 2 ways,

  1. Ignore $x_3$ (samples that contain characteristics of both classes) and treat the problem as a traditional binary classification.
  2. Label the samples in $x_3$ with both labels (positive & negative) and consider this as a multi-label classification problem.

I wish to follow the second option, as it is more natural choice. The reason being, ignoring some samples from the dataset gives me a feeling of manipulating the dataset artificially which may affect the performance of the classifier in the real world scenario. In this context, I have the following questions,

  1. Is it correct to treat this a multi-label classification problem. If so, is the intuition which is explained in the previous paragraph correct?
  2. Is there any other learning paradigm that can handle this scenario? If so, please provide reference to relevant literature.

Best Answer

The problem that you have described is not quite a multi-label classification problem. Multi-label classification allows an object to have any combination of labels, including no labels at all. So in your case where there are 2 labels, it would allow 4 possible outcomes. That being said, you may be able to adapt a multi-label classifier to exclude to "no label" case.

What you describe sounds more like an ordinal regression problem. That is, there are 3 labels, where one label is logically "in between" the other two. See the link for details.

Related Question