Solved – Text Mining: how to cluster texts (e.g. news articles) with artificial intelligence

clusteringfeature selectionneural networksself organizing mapstext mining

I have built some neural networks (MLP (fully-connected), Elman (recurrent)) for different tasks, like playing Pong, classifying handwritten digits and stuff…

Additionally I tried to build some first convolutional neural networks, e.g. for classifying multi-digit handwritten notes, but I am completely new to analyze and cluster texts, e.g. in image recognition/clustering tasks one can rely on standardized input, like 25×25 sized images, RGB or greyscale and so on…there are plenty of pre-assumption features.

For text mining, for instance news articles, you have an ever changing size of input (different words, different sentences, different text length, …).

How can one implement a modern text mining tool utilizing artificial intelligence, preferably neural networks / SOMs?

Unfortunately I were unable to find simple tutorials to start-off. Complex scientific papers are hard to read and not the best option for learning a topic (as to my opinion). I already read quite some papers about MLPs, dropout techniques, convolutional neural networks and so on, but I were unable to find a basic one about text mining – all I found was far too high level for my very limited text mining skills.

Best Answer

Latent Dirichlet Allocation (LDA) is great, but if you want something better that uses neural networks I would strongly suggest doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html).

What it does? It works similarly to Google's word2vec but instead of a single word feature vector you get a feature vector for a paragraph. The method is based on a skip-gram model and neural networks and is considered one of the best methods to extract a feature vector for documents.

Now given that you have this vector you can run k-means clustering (or any other preferable algorithm) and cluster the results.

Finally, to extract the feature vectors you can do it as easy as that:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence

class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(self.filename)):
            yield LabeledSentence(words=line.split(), labels=['TXT_%s' % uid])


sentences = LabeledLineSentence('your_text.txt')

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5,
                dm=1, workers=8, sample=1e-5)

model.build_vocab(sentences)

for epoch in range(500):
    try:
        print 'epoch %d' % (epoch)
        model.train(sentences)
        model.alpha *= 0.99
        model.min_alpha = model.alpha
    except (KeyboardInterrupt, SystemExit):
        break

Related Solutions

Solved – Convolutional neural networks: shared weights

The main advantage of shared weights, is that you can substantially lower the degrees of freedom of your problem. Take the simplest case, think of a tied autoencoder, where the input weights are $W_{x} \in \mathbb{R}^d$ and the output weights are $W_{x}^T$. You have lowered the parameters of your model by half from $2d \rightarrow d$. You can see some visualizations here: link. Similar results would be obtained in a Conv Net.

This way you can get the following results:

less parameters to optimize,
which means faster convergence to some minima,
at the expense of making your model less flexible. It is interesting to note that, this "less flexibility" can work as a regularizer many times and avoiding overfitting as the weights are shared with some other neurons.

Therefore, it is a nice tweak to experiment with and I would suggest you to try both. I've seen cases where sharing information (sharing weights), has paved the way to better performance, and others, that made my model become significantly less flexible.

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Best Answer

Related Solutions

Solved – Convolutional neural networks: shared weights

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Related Question