I have a set of 2000 small texts (each less than 500 words) that I manually categorized. All the texts are in the same main subject, and I want to separate them into distinct groups based on their similarity and focus on the topic. I would like to know what would be the best approach to automatically separate these texts. I do not have a training set and I would like to confirm the existing labeling or find an alternative clustering of my dataset.
Solved – guide for text classification using weka
data miningmachine learningtext miningweka
Best Answer
Using methods available in Weka, you could start by applying the StringToWordVector unsupervised attribute filter then running any suitable clustering method, remembering to ignore the existing class attribute if that's present in your dataset.
In the Weka Explorer you can save the cluster assignment by right-clicking the result in the result list, choose
Visualize cluster assignments
, then clickSave
.To compare the clustering result with the existing categorisation, assuming the dataset does contain the existing class, select
Classes to clusters evaluation
and choose the class attribute from the dropdown.Of course clustering is not guaranteed to separate the documents by subject - for example if there are different authors with very different writing styles who each contributed documents on each subject then you might find the clusters correspond to the authors. You might want to manually select a subset of attributes (i.e. words) from the filtered dataset that you judge are relevant to the topic(s) before clustering.