Solved – How to plot results from text mining (e.g. classification or clustering)

classificationclusteringdata visualizationpythontext mining

In text classification and clustering, the number of features are normally big, e.g. I currently get are around 5,000 features which is already really small compared to many other text mining tasks. Given I am totally new to visualization, I have no clue about how should I plot the results from text classification and clustering.

For example, in one clustering task, I have three categories, hundreds of documents, and some thousands of features in these texts. After I apply a cluster, e.g. the KMeans cluster from scikit-learn, I have no idea about how to plot these results onto a 2D figure. It is the same for classification, say, I get texts sorted into three categories, but I get no idea how to plot them.

I've tried to learn from some examples, but most ones I found are based on pure data with very few features rather than text.

Question: Could you refer me to any tutorial or paper on either drawing text clustering/classification results or explaining the math part of such visualization. Or any help would be appreciated.

Best Answer

If you're doing classification, this should be fairly straight forward. Just select some aggregate measure of performance (e.g. accuracy), and plot a distribution of that measure for different random initializations of k-means. This gives you some information about how well the algorithm would perform on average.

If you're doing clustering (i.e. unsupervised clustering), then you can make a pretty picture of the clusters using a vector compression technique. A simple technique might be to pick a point in space, and plot each point in your dataset with its euclidean distance from that point and its category. You could also use more advanced techniques like PCA.

If you're interested in finding out which features are good predictors, I suggest running something like a maximum entropy classifier on it, or one of many feature selection algorithms. These techniques will provide you with a weight for each feature indicating its importance in predicting the groupings.