Solved – Automatic labeling of training set

classificationinformation retrievalmachine learningtext miningunsupervised learning

I have once meet the following question, given a training set, is that possible to do the automatic labelling? In addition, if this training set consists of plain text files, is that possible to know what is the best description of positive class, and what is the best description of negative class?

My feeling is that, automatic labelling can be treated as an unsupervised learning; finding the best description from a set of text documents can be treated as an information extraction problem. But I would like to know, is there any formalized methodology or paper to discuss this problem? What’s the state-of-art for this issue?

Best Answer

The problem you refer to is semi-supervised learning, of which active learning is a particular case.

Here one has a few labelled samples, and a huge bunch of unlabelled samples. The goal is to be able to exploit the knowledge of the sampled samples to be able to build a good classifier.

Let us take the case of classification. The basic idea is that, here one would like to maximize $p(C|x)$, where $C$ is some label representing class membership and $x$ is a sample. Now, from the data obtained from sampling $p(x)$ one attempts to inference something about $p(C|x)$. The idea to attack the problem is to introduce some assumptions linking $p(C|x)$ and the properties of $p(x)$. Those are the assumptions listed in the Wikipedia article.

You may also find these slides very informative.