I have a collection of documents and want to detect a hierarchy of named topics from them, what are the pros/cons for using hierarchical latent Dirichlet allocation (h-LDA) over hierarchical agglomerative clustering (HAC)?
Solved – How does Hierarchical LDA compare to Hierachical Agglomerative Clustering
hierarchical clusteringhierarchical-bayesianmachine learningtopic-models
Related Solutions
What LDA does, and what it can answer
Consider this snippet from the paper introducing supervised LDA:
Most topic models, such as latent Dirichlet allocation (LDA), are unsupervised: only the words in the documents are modeled. The goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection.
In other words, for a given corpus and trained LDA model of fixed $k$, that's all you get: The latent topics that maximize the posterior probability of the observed corpus.
Now, that's not to say that a domain subject matter expert couldn't make some intuitive guesses in the right direction. Take a look at these topics from an LDA model trained for $k = 16$ on the handwritten digits data that ships with sklearn
:
Some are entirely recognizable as digits; some we're left to speculate about or further analyze, maybe "half a nine" or "one common way of writing a seven." (See the code below to produce this and a few other plots of varied number of topics.)
How many topics, via hierarchal topic models
Above, our choice of $k$ was taken from a quick look through an arbitrary space of possible parameters. This was straightforward since we rather expect that the number of meaningful topics won't be too far removed from ten, the number of digits.
In your case, there's no mention of prior knowledge that justifies either a chosen $k$, or even a subspace to search. Hierarchal topic models can handle this in a principled fashion, by employing Dirichlet processes. (Loosely, DPs can be thought of as an infinite-dimensional generalization of the Dirichlet distribution.) Empirically, it's been shown to choose $k$ similar to the LDA model that minimizes perplexity. From the paper:
Though hierarchal topic models can handle a single layered hierarchy, they were motivated by more elaborate models of dependency within and between groups, which may interest you:
We assume that the data are subdivided into a set of groups, and that within each group we wish to find clusters that capture latent structure in the data assigned to that group. The number of clusters within each group is unknown and is to be inferred. Moreover, in a sense that we make precise, we wish to allow clusters to be shared among the groups.
They go on further to detail an example of likely interest:
An example of the kind of problem that motivates us can be found in genetics. Consider a set of k binary markers (e.g., single nucleotide polymorphisms or “SNPs”) in a localized region of the human genome. While an individual human could exhibit any of 2 k different patterns of markers on a single chromosome, in real populations only a small subset of such patterns—haplotypes—are actually observed (Gabriel et al. 2002). [...] Now consider an extension of this problem in which the population is divided into a set of groups; e.g., African, Asian and European subpopulations. We may not only want to discover the sets of haplotypes within each subpopulation, but we may also wish to discover which haplotypes are shared between subpopulations. The identification of such haplotypes would have significant implications for the understanding of the migration patterns of ancestral populations of humans.
So, you can use hierarchal models simply to choose the number of topics, or to model much more elaborate group relationships. (I've not the slightest bioinformatics expertise, so I can't even begin to suggest what would be useful or appropriate, but I hope the details in the paper can help guide you.)
What the topics mean, via sLDA
Finally, if your data includes response variables you'd like to predict, e.g. the diseases or genetic disorders you mention, then supervised LDA is probably what you're looking for. From the paper linked above, emphasis mine:
In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associated with each document. As mentioned, this variable might be the number of stars given to a movie, a count of the users in an on-line community who marked an article interesting, or the category of a document. We jointly model the documents and the responses, in order to find latent topics that will best predict the response variables for future unlabeled documents.
A brief aside: Cited in the sLDA paper is this one, which may be of interest:
P. Flaherty, G. Giaever, J. Kumm, M. Jordan, and A. Arkin. A latent variable model for chemogenomic profiling. Bioinformatics, 21(15):3286–3293, 2005.
Code
# -*- coding: utf-8 -*-
"""
Created on Mon Apr 18 18:24:41 2016
@author: SeanEaster
"""
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt, numpy as np
def untick(sub):
sub.tick_params(which='both', bottom='off', top='off', labelbottom='off', labelleft='off', left='off', right='off')
digits = load_digits()
images = digits['images']
images = [image.reshape((1,-1)) for image in images]
images = np.concatenate(tuple(images), axis = 0)
topicsRange = [i + 4 for i in range(22)]
ldaModels = [LDA(n_topics = numTopics) for numTopics in topicsRange]
for lda in ldaModels:
lda.fit(images)
scores = [lda.score(images) for lda in ldaModels]
plt.plot(topicsRange, scores)
plt.show()
maxLogLikelihoodTopicsNumber = np.argmax(scores)
plotNumbers = [4, 9, 16, 25]
if maxLogLikelihoodTopicsNumber not in plotNumbers:
plotNumbers.append(maxLogLikelihoodTopicsNumber)
for numberOfTopics in plotNumbers:
plt.figure()
modelIdx = topicsRange.index(numberOfTopics)
lda = ldaModels[modelIdx]
sideLen = int(np.ceil(np.sqrt(numberOfTopics)))
for topicIdx, topic in enumerate(lda.components_):
ax = plt.subplot(sideLen, sideLen, topicIdx + 1)
ax.imshow(topic.reshape((8,8)), cmap = plt.cm.gray_r)
untick(ax)
plt.show()
You can use a sentence splitter and split your document into sentences. I have never used the approach myself, but the tool is available with the open.nlp package in R, Python and Rapidminer.
What you could also do is to train a topicmodel on corpus with clearly defined topics. Next you use the same model on your one document and you see how the topic structure turn out.
Best Answer
The most important difference to your task, I think, is that h-LDA is the only method that creates a hierarchy of topics; hierarchal agglomerative clustering (HAC) returns a hierarchy of documents (see here). Sure, one reasons that HAC-grouped documents share topics of varying specificity, and that one can discern topics from groups. But HAC doesn't formally model them.
The distributions over words in h-LDA, on the other hand, have a straight-forward interpretation as topics of greater and greater specificity, as shown in the h-LDA paper example (below).
HAC has no probabilistic interpretation; it relies solely on a similarity function. Hierarchal LDA does, and can be used to return the topic distribution of a new document.
HAC is a deterministic algorithm, and will return the same hierarchy given the same input; h-LDA can optimize to a local optima, meaning that its results can vary from run to run.