Solved – Visualizing Mutual Information Against TF-IDF for Text Corpus Data

correlationdata visualizationmutual informationnatural language

I'm working on a data visualization project for the semester and have decided to work with a corpus of discussion forum data focused around debate over political issues (available here). I'm visualizing different frequency counts of unigrams, bigrams, and trigrams, but I was also curious about using different metrics for visualization such as TF-IDF and Mutual Information.

One of my ideas for a visualization was to plot n-grams in a scatterplot showing their TF-IDF scores against their Mutual Information scores (Mutual Information being MI between an n-gram and a topic of debate within the corpus, say, "abortion"). My thought was to have each of these metrics as an axis and have each data point represent an n-gram within the plot.

My questions are (a) is this informative at all, or is it completely statistically unsound or redundant to plot n-grams for these two metrics this way, and (b) if something like this would make for a decent visualization, is Mutual Information the best correlation metric to use for one of the axes? Would something else like Chi-square be more appropriate? Thanks.

Best Answer

Instead of using two different metrics on different axes you can use one metric and use latent semantic analysis to do a 2-d embedding for visualization. https://en.wikipedia.org/wiki/Latent_semantic_analysis

The entries in the occurrence matrix can be normalized with tf-idf scores or with mutual information.