Solved – How does Hierarchical LDA compare to Hierachical Agglomerative Clustering

hierarchical clusteringhierarchical-bayesianmachine learningtopic-models

I have a collection of documents and want to detect a hierarchy of named topics from them, what are the pros/cons for using hierarchical latent Dirichlet allocation (h-LDA) over hierarchical agglomerative clustering (HAC)?

Best Answer

The most important difference to your task, I think, is that h-LDA is the only method that creates a hierarchy of topics; hierarchal agglomerative clustering (HAC) returns a hierarchy of documents (see here). Sure, one reasons that HAC-grouped documents share topics of varying specificity, and that one can discern topics from groups. But HAC doesn't formally model them.

The distributions over words in h-LDA, on the other hand, have a straight-forward interpretation as topics of greater and greater specificity, as shown in the h-LDA paper example (below).

HAC has no probabilistic interpretation; it relies solely on a similarity function. Hierarchal LDA does, and can be used to return the topic distribution of a new document.

HAC is a deterministic algorithm, and will return the same hierarchy given the same input; h-LDA can optimize to a local optima, meaning that its results can vary from run to run.

enter image description here