Solved – combining supervised and unsupervised learning

classificationsemi-supervised-learning

I am trying to classify short natural language documents, for which I have a small labeled dataset. Using out-of-the-box document classifiers and basic td-idf representation, I am able to get "reasonable" performance. But the data is so sparse that it is doing little more than keyword matching.

However, I also have a reasonably large corpus of unlabeled documents with high domain overlap. Using unsupervised techniques (LDA, LSA, doc2vec, clustering) I am also able to get reasonable results.

I feel like there should be some way to use all the data together, but I don't really know where to start at this point. My intuition is that the unlabeled data carries much information (for example, word synonymy and polysemy), that I should be able use to give my classifier a good head start on the language model.

Can anyone suggest any algorithms/techniques/white papers/libraries that fit the bill, or tell me why I'm thinking about this the wrong way?

Best Answer

OpenAI recently published a paper and blog post that may answer your question. Basically they trained a Char-RNN on 82 million unlabelled Amazon reviews, and using the LSTM units as vector representations of the text, they added a linear (logistic regression) classifier that's trained on only 11 labeled datapoints and achieved pretty good results. Note: I haven't tested this method myself.

Related Question