I met a problem of using novelty and outlier detection for my multi-label data. For example, I have got some training data that is not polluted by outliers. However, the training data are with multi-labels, let's say the data is with 150 cases with 50 labelled with class A, 50 labelled with class B, and the rest labelled with C.
My testing data contains outliers, and it also contains good data that can be labelled with A, B or C. Is that possible to use something like 'One Class SVM' to distinguish outliers from good data (mixed with label A, B and C)?
Or it can only deal with one class like distinguishing good data labelled with A from outliers?
Thanks. A.
Best Answer
Let's first clarify the terminology a little. The data you have, according to the description, is a multi-class data rather than a multi-label one. In both cases, the number of possible classes/labels of the data set is equal or larger than 2, i.e., $\lvert Y \rvert \ge 2$. The difference is, in a multi-class data set, one instance $\mathbf{x}$ is associated with one and only one label, i.e., $\lvert \mathbf{y}_\mathbf{x} \rvert = 1$, while in the multi-label case, one instance $\mathbf{x}$ is associated with one or more labels, i.e., $\lvert \mathbf{y}_\mathbf{x} \rvert \ge 1$. As you see, multi-class setting is a special case of the multi-label setting.
In the conventional setting of outlier detection, the instances are not labeled. In that case, unsupervised techniques such as one-class SVM or density estimation (e.g., mixture of Gaussians) are often applied.
As you have labels at hand, it would be silly not using them. There are at least three options I can think of.