I have a classification problem with 2 classes (positive and negative). Usually, in such classification problems, all the samples will be labelled either 'positive' or 'negative'. In my dataset, some of the samples possess a combination of both positive and negative characteristics. Formally, if the dataset is $x$, then,
$x=x_1 \cup x_2 \cup x_3$
where $x_1$ is the set of all positive samples, $x_2$ is the set of all negative samples and $x_3$ is the set of samples that contain the characteristics of both the classes.
As far as I could think of, this situation could be handled in 2 ways,
- Ignore $x_3$ (samples that contain characteristics of both classes) and treat the problem as a traditional binary classification.
- Label the samples in $x_3$ with both labels (positive & negative) and consider this as a multi-label classification problem.
I wish to follow the second option, as it is more natural choice. The reason being, ignoring some samples from the dataset gives me a feeling of manipulating the dataset artificially which may affect the performance of the classifier in the real world scenario. In this context, I have the following questions,
- Is it correct to treat this a multi-label classification problem. If so, is the intuition which is explained in the previous paragraph correct?
- Is there any other learning paradigm that can handle this scenario? If so, please provide reference to relevant literature.
Best Answer
The problem that you have described is not quite a multi-label classification problem. Multi-label classification allows an object to have any combination of labels, including no labels at all. So in your case where there are 2 labels, it would allow 4 possible outcomes. That being said, you may be able to adapt a multi-label classifier to exclude to "no label" case.
What you describe sounds more like an ordinal regression problem. That is, there are 3 labels, where one label is logically "in between" the other two. See the link for details.