Solved – In LDA, how to interpret the meaning of topics

latent-classtopic-models

I am studying Latent Dirichlet Allocation (LDA) model, and I found some explanations around the web (for example here on Quora.com).
In the link examples, I can clearly see which are the topics author is talking about (food and cute animals).
I understood how the model works when you have an idea about the topics meaning. But what happens when you do not know the topics meaning?

How LDA model could tell you what are the topics about?

How LDA model could tell you how many topics are there?

For example, if you're running the LDA algorithm to analyze occurrences of genes and their functions, how could the model tell you if the topics are about diseases, or metabolic pathways, or genetic disorders, or any other concept that relates genes and functions?

Best Answer

What LDA does, and what it can answer

Consider this snippet from the paper introducing supervised LDA:

Most topic models, such as latent Dirichlet allocation (LDA), are unsupervised: only the words in the documents are modeled. The goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection.

In other words, for a given corpus and trained LDA model of fixed $k$, that's all you get: The latent topics that maximize the posterior probability of the observed corpus.

Now, that's not to say that a domain subject matter expert couldn't make some intuitive guesses in the right direction. Take a look at these topics from an LDA model trained for $k = 16$ on the handwritten digits data that ships with sklearn:

Some are entirely recognizable as digits; some we're left to speculate about or further analyze, maybe "half a nine" or "one common way of writing a seven." (See the code below to produce this and a few other plots of varied number of topics.)

How many topics, via hierarchal topic models

Above, our choice of $k$ was taken from a quick look through an arbitrary space of possible parameters. This was straightforward since we rather expect that the number of meaningful topics won't be too far removed from ten, the number of digits.

In your case, there's no mention of prior knowledge that justifies either a chosen $k$, or even a subspace to search. Hierarchal topic models can handle this in a principled fashion, by employing Dirichlet processes. (Loosely, DPs can be thought of as an infinite-dimensional generalization of the Dirichlet distribution.) Empirically, it's been shown to choose $k$ similar to the LDA model that minimizes perplexity. From the paper:

Though hierarchal topic models can handle a single layered hierarchy, they were motivated by more elaborate models of dependency within and between groups, which may interest you:

We assume that the data are subdivided into a set of groups, and that within each group we wish to find clusters that capture latent structure in the data assigned to that group. The number of clusters within each group is unknown and is to be inferred. Moreover, in a sense that we make precise, we wish to allow clusters to be shared among the groups.

They go on further to detail an example of likely interest:

An example of the kind of problem that motivates us can be found in genetics. Consider a set of k binary markers (e.g., single nucleotide polymorphisms or “SNPs”) in a localized region of the human genome. While an individual human could exhibit any of 2 k different patterns of markers on a single chromosome, in real populations only a small subset of such patterns—haplotypes—are actually observed (Gabriel et al. 2002). [...] Now consider an extension of this problem in which the population is divided into a set of groups; e.g., African, Asian and European subpopulations. We may not only want to discover the sets of haplotypes within each subpopulation, but we may also wish to discover which haplotypes are shared between subpopulations. The identification of such haplotypes would have significant implications for the understanding of the migration patterns of ancestral populations of humans.

So, you can use hierarchal models simply to choose the number of topics, or to model much more elaborate group relationships. (I've not the slightest bioinformatics expertise, so I can't even begin to suggest what would be useful or appropriate, but I hope the details in the paper can help guide you.)

What the topics mean, via sLDA

Finally, if your data includes response variables you'd like to predict, e.g. the diseases or genetic disorders you mention, then supervised LDA is probably what you're looking for. From the paper linked above, emphasis mine:

In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associated with each document. As mentioned, this variable might be the number of stars given to a movie, a count of the users in an on-line community who marked an article interesting, or the category of a document. We jointly model the documents and the responses, in order to find latent topics that will best predict the response variables for future unlabeled documents.

A brief aside: Cited in the sLDA paper is this one, which may be of interest:

P. Flaherty, G. Giaever, J. Kumm, M. Jordan, and A. Arkin. A latent variable model for chemogenomic profiling. Bioinformatics, 21(15):3286–3293, 2005.

Code

# -*- coding: utf-8 -*-
"""
Created on Mon Apr 18 18:24:41 2016

@author: SeanEaster
"""

from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.datasets import load_digits

import matplotlib.pyplot as plt, numpy as np

def untick(sub):
    sub.tick_params(which='both', bottom='off', top='off',  labelbottom='off', labelleft='off', left='off', right='off')

digits = load_digits()



images = digits['images']
images = [image.reshape((1,-1)) for image in images]
images = np.concatenate(tuple(images), axis = 0)

topicsRange = [i + 4 for i in range(22)]

ldaModels = [LDA(n_topics = numTopics) for numTopics in topicsRange]

for lda in ldaModels:
    lda.fit(images)

scores = [lda.score(images) for lda in ldaModels]

plt.plot(topicsRange, scores)
plt.show()

maxLogLikelihoodTopicsNumber = np.argmax(scores)
plotNumbers = [4, 9, 16, 25]

if maxLogLikelihoodTopicsNumber not in plotNumbers:
    plotNumbers.append(maxLogLikelihoodTopicsNumber)

for numberOfTopics in plotNumbers:
    plt.figure()
    modelIdx = topicsRange.index(numberOfTopics)
    lda = ldaModels[modelIdx]
    sideLen = int(np.ceil(np.sqrt(numberOfTopics)))
    for topicIdx, topic in enumerate(lda.components_):
        ax = plt.subplot(sideLen, sideLen, topicIdx + 1)
        ax.imshow(topic.reshape((8,8)), cmap = plt.cm.gray_r)
        untick(ax)
    plt.show()

Related Solutions

Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

This is indeed something often glossed over.

Some people are doing something a bit cheeky: holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.

To do it properly with held-out documents, as suggested, you do need to "integrate over the Dirichlet prior for all possible topic mixtures". http://people.cs.umass.edu/~wallach/talks/evaluation.pdf reviews a few methods for tackling this slightly unpleasant integral. I'm just about to try and implement this myself in fact, so good luck!

Solved – Using LDA to calculate similarity

(Pseudo-code) Computing similarity between two documents (doc1, doc2) using existing LDA model:

lda_vec1, lda_vec2 = lda(doc1), lda(doc2)
score <- similarity(lda_vec1, lda_vec2)

In the first step, you simply apply your LDA model on the two input documents, getting back a vector for each document. The vector represents the topic distribution for the document.

The second step is to apply a similarity measure of your choice to compare the two vectors. You should experiment with different types of similarity measures to see which one works best in your case. Some good options to consider for distance metrics are cosine distance and Hellinger distance. Note that the underlying assumption here is that we consider two documents to be similar if their presumed topics are similar.

Example using Cosine similarity:

similarity = gensim.matutils.cossim(lda_vec1, lda_vec2)