Solved – Novelty and Outlier Detection for Multi-label Data

classificationmachine learningnovelty-detectionoutlierssvm

I met a problem of using novelty and outlier detection for my multi-label data. For example, I have got some training data that is not polluted by outliers. However, the training data are with multi-labels, let's say the data is with 150 cases with 50 labelled with class A, 50 labelled with class B, and the rest labelled with C.

My testing data contains outliers, and it also contains good data that can be labelled with A, B or C. Is that possible to use something like 'One Class SVM' to distinguish outliers from good data (mixed with label A, B and C)?

Or it can only deal with one class like distinguishing good data labelled with A from outliers?

Thanks. A.

Best Answer

Let's first clarify the terminology a little. The data you have, according to the description, is a multi-class data rather than a multi-label one. In both cases, the number of possible classes/labels of the data set is equal or larger than 2, i.e., $\lvert Y \rvert \ge 2$. The difference is, in a multi-class data set, one instance $\mathbf{x}$ is associated with one and only one label, i.e., $\lvert \mathbf{y}_\mathbf{x} \rvert = 1$, while in the multi-label case, one instance $\mathbf{x}$ is associated with one or more labels, i.e., $\lvert \mathbf{y}_\mathbf{x} \rvert \ge 1$. As you see, multi-class setting is a special case of the multi-label setting.

In the conventional setting of outlier detection, the instances are not labeled. In that case, unsupervised techniques such as one-class SVM or density estimation (e.g., mixture of Gaussians) are often applied.

As you have labels at hand, it would be silly not using them. There are at least three options I can think of.

  • Define the outlier score separately on the feature space $\mathcal{X}$ and the label space $\mathcal{Y}$. On $\mathcal{X}$, conventional unsupervised techniques can be used. On $\mathcal{Y}$, similarity measures for sets or binary vectors can be used. At the end, you obtain two outlier scores $S_\mathcal{X}$ and $S_\mathcal{Y}$ and the final outlier score can be defined as $S = S_\mathcal{X} + \lambda S_\mathcal{y}$, where $\lambda$ is a trade-off parameter to balance these two values and it should be tuned.
  • Train a multi-class classifier on the training data. When a test instance arrives, apply the classifier on it and if the predicted label is different from the test label, this instance is likely an outlier. (If the data set is multi-label instead of multi-class, we need to have a similarity measure on the label space to quantify if the predicted labels are significantly different from the given labels.)
  • Simply concatenate the feature vector and the label/class vector and apply the standard unsupervised techniques on the combined vectors.
Related Question