Multi-Dimensional Data Visualization – How to Visualize Using LSI in 2D

clusteringdata visualizationmultidimensional scalingpython

I'm using latent semantic indexing to find similarities between documents (thanks, JMS!)

After dimension reduction, I've tried k-means clustering to group the documents into clusters, which works very well. But I'd like to go a bit further, and visualize the documents as a set of nodes, where the distance between any two nodes is inversely proportional to their similarity (nodes that are highly similar are close together).

It strikes me that I can't accurately reduce a similarity matrix to a 2-dimensional graph since my data is > 2 dimensions. So my first question: is there a standard way to do this?

Could I just reduce my data to two dimensions and then plot them as the X and Y axis, and would that suffice for a group of ~100-200 documents? If this is the solution, is it better to reduce my data to 2 dimensions from the start, or is there any way to pick the two "best" dimensions from my multi-dimensional data?

I am using Python and the gensim library if that makes a difference.

Best Answer

This is what MDS (multidimensional scaling) is designed for. In short, if you're given a similarity matrix M, you want to find the closest approximation $S = X X^\top$ where $S$ has rank 2. This can be done by computing the SVD of $M = V \Lambda V^\top = X X^\top$ where $X = V \Lambda^{1/2}$.

Now, assuming that $\Lambda$ is permuted so the eigenvalues are in decreasing order, the first two columns of $X$ are your desired embedding in the plane.

There's lots of code available for MDS (and I'd be surprised if scipy doesn't have some version of it). In any case as long as you have access to some SVD routine in python you're set.