Solved – Text Mining: how to cluster texts (e.g. news articles) with artificial intelligence

clusteringfeature selectionneural networksself organizing mapstext mining

I have built some neural networks (MLP (fully-connected), Elman (recurrent)) for different tasks, like playing Pong, classifying handwritten digits and stuff…

Additionally I tried to build some first convolutional neural networks, e.g. for classifying multi-digit handwritten notes, but I am completely new to analyze and cluster texts, e.g. in image recognition/clustering tasks one can rely on standardized input, like 25×25 sized images, RGB or greyscale and so on…there are plenty of pre-assumption features.

For text mining, for instance news articles, you have an ever changing size of input (different words, different sentences, different text length, …).

How can one implement a modern text mining tool utilizing artificial intelligence, preferably neural networks / SOMs?

Unfortunately I were unable to find simple tutorials to start-off. Complex scientific papers are hard to read and not the best option for learning a topic (as to my opinion). I already read quite some papers about MLPs, dropout techniques, convolutional neural networks and so on, but I were unable to find a basic one about text mining – all I found was far too high level for my very limited text mining skills.

Best Answer

Latent Dirichlet Allocation (LDA) is great, but if you want something better that uses neural networks I would strongly suggest doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html).

What it does? It works similarly to Google's word2vec but instead of a single word feature vector you get a feature vector for a paragraph. The method is based on a skip-gram model and neural networks and is considered one of the best methods to extract a feature vector for documents.

Now given that you have this vector you can run k-means clustering (or any other preferable algorithm) and cluster the results.

Finally, to extract the feature vectors you can do it as easy as that:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence

class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(self.filename)):
            yield LabeledSentence(words=line.split(), labels=['TXT_%s' % uid])


sentences = LabeledLineSentence('your_text.txt')

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5,
                dm=1, workers=8, sample=1e-5)

model.build_vocab(sentences)

for epoch in range(500):
    try:
        print 'epoch %d' % (epoch)
        model.train(sentences)
        model.alpha *= 0.99
        model.min_alpha = model.alpha
    except (KeyboardInterrupt, SystemExit):
        break