Solved – guide for text classification using weka

data miningmachine learningtext miningweka

I have a set of 2000 small texts (each less than 500 words) that I manually categorized. All the texts are in the same main subject, and I want to separate them into distinct groups based on their similarity and focus on the topic. I would like to know what would be the best approach to automatically separate these texts. I do not have a training set and I would like to confirm the existing labeling or find an alternative clustering of my dataset.

Best Answer

Using methods available in Weka, you could start by applying the StringToWordVector unsupervised attribute filter then running any suitable clustering method, remembering to ignore the existing class attribute if that's present in your dataset.

In the Weka Explorer you can save the cluster assignment by right-clicking the result in the result list, choose Visualize cluster assignments, then click Save.

To compare the clustering result with the existing categorisation, assuming the dataset does contain the existing class, select Classes to clusters evaluation and choose the class attribute from the dropdown.

Of course clustering is not guaranteed to separate the documents by subject - for example if there are different authors with very different writing styles who each contributed documents on each subject then you might find the clusters correspond to the authors. You might want to manually select a subset of attributes (i.e. words) from the filtered dataset that you judge are relevant to the topic(s) before clustering.