Solved – Text Embeddings on a Small Dataset

classificationnatural languageword embeddings

I am trying to solve a binary text classification problem of academic text in a niche domain (Generative vs Cognitive Linguistics). My target text data consists of near 400 paper abstracts with less than 300 words in each. Previously I tried to use Doc2Vec in order to solve the problem, but the best accuracy that I could get was around 82%. I have since tried to use pre-trained vectors but the consensus on Doc2Vec is that it is best not to use pre-trained doc2vecs. I have tried using pre-trained Word2Vec models, but the models are usually huge and my laptop (8 GBs of RAM) cannot handle loading them. So as a result I tried to collect a larger source data myself, and train a word embedding model on that data, and then use the word vectors in the target domain.

I have collected more than 70K of paper abstracts in the related fields (mostly papers categorized with Linguistics tag), and have trained FastText, Doc2Vec and Word2Vec models on the source data. But after using these models in the target domain, the results are not even better than my previous attempts with a simple Doc2Vec, let alone being marginally better.

I have also tried using TFIDF and CountVectorizer on the target domain, but the results do not become better.

Yesterday I stumbled upon this implementation of getting document vectors using Word2Vec and TF IDF simultaneously, but to my surprise, the results are on par with averaging the documents.

I was thinking of maybe using active learning in the process? Since maybe the bottleneck is the very small target dataset. Or maybe generating synthetic texts similar to the target data?

Thank you for reading this.

Best Answer

For this type of problem you may want to consider ULMFiT or a similar fine-tuning approach. The advantage of this approach is that you fine-tune a language model trained on another corpus to your smaller corpus (of ~70k) and then you create a classifier on-top of a whole language model on the small amount of labelled data. I.e. instead of just using a first-layer embedding like the ones you mentioned, you get to use structure in the whole neural network (that you can fine-tuning with appropriate careful unfreezing of layers). This is implemented e.g. in the fastai python library and described in the lectures of the excellent fast.ai deep-learning course. Additionally, you may consider some data augmentation during your training (e.g. round-trip translation or word/phrase-substitution).