Solved – Inferring the number of topics for gensim’s LDA – perplexity, CM, AIC, and BIC

aiclatent-classlatent-dirichlet-allocperplexitytopic-models

I am confused as to how to interpret the LDA's perplexity fluctuations with different numbers of topics, in the endeavour of determining the best number of topics. Additionally, I would like to know how to implement AIC/BIC with gensim LDA models.

I am importing the 20 newsgroups dataset from sklearn:

from sklearn.datasets import fetch_20newsgroups;

Metadata were removed as per sklearn recommendation, and the data were split to test and train using sklearn also (subset parameter).

I trained 35 LDA models with different values for k, the number of topics, ranging from 1 to 100, using the train subset of the data. Afterwards, I estimated the per-word perplexity of the models using gensim's multicore LDA log_perplexity function, using the test held-out corpus::

DLM_testCorpusBoW = [DLM_fullDict.doc2bow(tstD) for tstD in testData];    
PerWordPP = modelLDA.log_perplexity(DLM_testCorpusBoW);

Eventually, keeping in mind that true k is 20 for the used dataset, perplexity figures were startling negative ones:

Additionally, I implemented Topic Coherence Models, and the results weren't very informative, with a lot of fluctuation:

Having negative perplexity apparently is due to infinitesimal probabilities being converted to the log scale automatically by Gensim, but even though a lower perplexity is desired, the lower bound value denotes deterioration (according to this), so the lower bound value of perplexity is deteriorating with a larger number of topics in my figures, yet we expect the perplexity to improve with a bigger k (here).

So, do the figures seem logical?
And I would like to implement AIC and BIC as well, but for that I need the SSE of the models, how can I get it using gensim? or are these measures implemented somehow?

Best Answer

Counter intuitively, it appears that the log_perplexity function doesn't output a $perplexity$ after all (the documentation of the function wasn't clear enough for me personally), but a likelihood $bound$ which must be utilised in the perplexity's lower bound equation thus (Taken from this paper - Online Learning for Latent Dirichlet Allocation by Hoffman, Blei and Bach): $$ perplexity (n^{test}, \lambda, \alpha) \leq exp\{ -(\sum_i{ \mathbb{E}_q [log_p(n_i^{test}, \theta_i, z_i | \alpha, \beta)] - \mathbb{E}_q[log_q(\theta_i, z_i)]) / (\sum_{i,w}{n_{iw}^{test}}) } \} $$

Viz.,

$$ perplexity (n^{test}, \lambda, \alpha) \leq e^{- bound} $$

Some people like to use $2$ instead of $e$ in the equation above.

For calculating $AIC$ and $BIC$, one usually needs the Bayesian likelihood of the model, not necessarily the $SSE$, especially in a topic modelling environment.

Finally, as for the $U Mass$ coherence measure, to the best of my knowledge, it hasn't been used in a model selection scenario with LDA yet, but the sharp dip I got at $k=20$ (the proper number of topics according to the 20 newsgroups dataset) is encouraging. However, topic coherence measures should be close to zero optimally, so that sharp dip isn't an improvement, rather a deterioration in the coherence (the meaningfulness or interpretability) of topics.

What LDA does, and what it can answer

Consider this snippet from the paper introducing supervised LDA:

Most topic models, such as latent Dirichlet allocation (LDA), are unsupervised: only the words in the documents are modeled. The goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection.

In other words, for a given corpus and trained LDA model of fixed $k$, that's all you get: The latent topics that maximize the posterior probability of the observed corpus.

Now, that's not to say that a domain subject matter expert couldn't make some intuitive guesses in the right direction. Take a look at these topics from an LDA model trained for $k = 16$ on the handwritten digits data that ships with sklearn:

Some are entirely recognizable as digits; some we're left to speculate about or further analyze, maybe "half a nine" or "one common way of writing a seven." (See the code below to produce this and a few other plots of varied number of topics.)

How many topics, via hierarchal topic models

Above, our choice of $k$ was taken from a quick look through an arbitrary space of possible parameters. This was straightforward since we rather expect that the number of meaningful topics won't be too far removed from ten, the number of digits.

In your case, there's no mention of prior knowledge that justifies either a chosen $k$, or even a subspace to search. Hierarchal topic models can handle this in a principled fashion, by employing Dirichlet processes. (Loosely, DPs can be thought of as an infinite-dimensional generalization of the Dirichlet distribution.) Empirically, it's been shown to choose $k$ similar to the LDA model that minimizes perplexity. From the paper:

Though hierarchal topic models can handle a single layered hierarchy, they were motivated by more elaborate models of dependency within and between groups, which may interest you:

We assume that the data are subdivided into a set of groups, and that within each group we wish to find clusters that capture latent structure in the data assigned to that group. The number of clusters within each group is unknown and is to be inferred. Moreover, in a sense that we make precise, we wish to allow clusters to be shared among the groups.

They go on further to detail an example of likely interest:

An example of the kind of problem that motivates us can be found in genetics. Consider a set of k binary markers (e.g., single nucleotide polymorphisms or “SNPs”) in a localized region of the human genome. While an individual human could exhibit any of 2 k different patterns of markers on a single chromosome, in real populations only a small subset of such patterns—haplotypes—are actually observed (Gabriel et al. 2002). [...] Now consider an extension of this problem in which the population is divided into a set of groups; e.g., African, Asian and European subpopulations. We may not only want to discover the sets of haplotypes within each subpopulation, but we may also wish to discover which haplotypes are shared between subpopulations. The identification of such haplotypes would have significant implications for the understanding of the migration patterns of ancestral populations of humans.

So, you can use hierarchal models simply to choose the number of topics, or to model much more elaborate group relationships. (I've not the slightest bioinformatics expertise, so I can't even begin to suggest what would be useful or appropriate, but I hope the details in the paper can help guide you.)

What the topics mean, via sLDA

Finally, if your data includes response variables you'd like to predict, e.g. the diseases or genetic disorders you mention, then supervised LDA is probably what you're looking for. From the paper linked above, emphasis mine:

In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associated with each document. As mentioned, this variable might be the number of stars given to a movie, a count of the users in an on-line community who marked an article interesting, or the category of a document. We jointly model the documents and the responses, in order to find latent topics that will best predict the response variables for future unlabeled documents.

A brief aside: Cited in the sLDA paper is this one, which may be of interest:

P. Flaherty, G. Giaever, J. Kumm, M. Jordan, and A. Arkin. A latent variable model for chemogenomic profiling. Bioinformatics, 21(15):3286–3293, 2005.

Code

# -*- coding: utf-8 -*-
"""
Created on Mon Apr 18 18:24:41 2016

@author: SeanEaster
"""

from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.datasets import load_digits

import matplotlib.pyplot as plt, numpy as np

def untick(sub):
    sub.tick_params(which='both', bottom='off', top='off',  labelbottom='off', labelleft='off', left='off', right='off')

digits = load_digits()



images = digits['images']
images = [image.reshape((1,-1)) for image in images]
images = np.concatenate(tuple(images), axis = 0)

topicsRange = [i + 4 for i in range(22)]

ldaModels = [LDA(n_topics = numTopics) for numTopics in topicsRange]

for lda in ldaModels:
    lda.fit(images)

scores = [lda.score(images) for lda in ldaModels]

plt.plot(topicsRange, scores)
plt.show()

maxLogLikelihoodTopicsNumber = np.argmax(scores)
plotNumbers = [4, 9, 16, 25]

if maxLogLikelihoodTopicsNumber not in plotNumbers:
    plotNumbers.append(maxLogLikelihoodTopicsNumber)

for numberOfTopics in plotNumbers:
    plt.figure()
    modelIdx = topicsRange.index(numberOfTopics)
    lda = ldaModels[modelIdx]
    sideLen = int(np.ceil(np.sqrt(numberOfTopics)))
    for topicIdx, topic in enumerate(lda.components_):
        ax = plt.subplot(sideLen, sideLen, topicIdx + 1)
        ax.imshow(topic.reshape((8,8)), cmap = plt.cm.gray_r)
        untick(ax)
    plt.show()

Solved – LDA topics number – determining the ‘fit’ level with current number of topics

Yes, in fact this is the cross validation method of finding the number of topics. But note that you should minimize the perplexity of a held-out dataset to avoid overfitting.

It's worth noting that a non-parametric extension of LDA can derive the number of topics from the data without cross validation. Implementations exist at David Blei's lab github, but at the time of this writing I haven't see HDP LDA implemented in any mainstream, open-source ML libraries.

Caveat: Hierarchal LDA is different. It finds a hierarchy of topics, whereas hierarchal Dirichlet processes let you fit a potentially infinite number of flat topics. (The dual use of "hierarchy" can be confusing.)

Best Answer

Related Solutions

Solved – In LDA, how to interpret the meaning of topics

What LDA does, and what it can answer

How many topics, via hierarchal topic models

What the topics mean, via sLDA

Code

Solved – LDA topics number – determining the ‘fit’ level with current number of topics

Related Question