Solved – Plot Latent Dirichlet Allocation output using t-SNE

dimensionality reductionlatent-dirichlet-alloctsne

I found this blog where the author trains an Latent Dirichlet Allocation (LDA) model on 20 Newsgroups. The output is then an $N\times K$ matrix where $N$ is the number of articles (row wise) and $K$ is the number of topics (column wise) i.e. each row is a discrete topic distribution.

The author then uses t-SNE to reduce the dimensionality of the matrix from $K$ dimensions to 2 dimensions to be able visualize the document groupings by topic. The document groupings of the t-SNE output even seem to make sense.

My question is, is it reasonable to do this? LDA outputs a discrete distribution over topics for every document. t-SNE reduces the dimensionality of vectors / points in a high dimensional space to visualize local structure. As the output of LDA is a distribution, I thought it would be somehow incorrect to do this? I understand that the distribution, being discrete, can be thought of as a point in the $K$ dimensional space. But using t-SNE to visualize a discrete output somehow seems incorrect. Am I missing something here?

EDIT: The metric the author uses in t-SNE is euclidean distance – that is why I am confused, because the author is using the euclidean distance to compare distributions.

Best Answer

I think the approach described in the blog post is reasonable. The goal of t-SNE is to find a representation of the input in low dimensional space such that similar points in the original space are also similar in the representation space. In the blog post inputs are topic probabilities for each document. So documents with low euclidean distance between topic probabilities should have similar t-SNE representations.

So what does euclidean distance between topic probabilities measure? Let's say we have topic probabilities of two documents $$ p = (p_1, ..., p_K),$$ $$ q = (q_1, ..., q_K).$$ If the distance between $p$ and $q$ is $0$ then documents have exactly the same topic distributions. If the distance between $p$ and $q$ increases, topic distributions become more separated. Extreme case is when $p$ and $q$ are in the form (0,...,0,1,0,...,0), and $1$ occurs on different coordinate, so the documents have completely different topics. Then the distance is maximal and equal to $1$. So the distance between t-SNE coordinates should represent how similar are the subjects of two documents.

There are other measures of distance between discrete distributions (i.e Jensen-Shannon). However Euclidean distance is simple and it worked in that particular case.

Related Question