Solved – Inferring the number of topics for gensim’s LDA – perplexity, CM, AIC, and BIC

aiclatent-classlatent-dirichlet-allocperplexitytopic-models

I am confused as to how to interpret the LDA's perplexity fluctuations with different numbers of topics, in the endeavour of determining the best number of topics. Additionally, I would like to know how to implement AIC/BIC with gensim LDA models.

I am importing the 20 newsgroups dataset from sklearn:

from sklearn.datasets import fetch_20newsgroups;

Metadata were removed as per sklearn recommendation, and the data were split to test and train using sklearn also (subset parameter).

I trained 35 LDA models with different values for k, the number of topics, ranging from 1 to 100, using the train subset of the data. Afterwards, I estimated the per-word perplexity of the models using gensim's multicore LDA log_perplexity function, using the test held-out corpus::

DLM_testCorpusBoW = [DLM_fullDict.doc2bow(tstD) for tstD in testData];    
PerWordPP = modelLDA.log_perplexity(DLM_testCorpusBoW);

Eventually, keeping in mind that true k is 20 for the used dataset, perplexity figures were startling negative ones:

enter image description here

Additionally, I implemented Topic Coherence Models, and the results weren't very informative, with a lot of fluctuation:

enter image description here
enter image description here

Having negative perplexity apparently is due to infinitesimal probabilities being converted to the log scale automatically by Gensim, but even though a lower perplexity is desired, the lower bound value denotes deterioration (according to this), so the lower bound value of perplexity is deteriorating with a larger number of topics in my figures, yet we expect the perplexity to improve with a bigger k (here).

So, do the figures seem logical?
And I would like to implement AIC and BIC as well, but for that I need the SSE of the models, how can I get it using gensim? or are these measures implemented somehow?

Best Answer

Counter intuitively, it appears that the log_perplexity function doesn't output a $perplexity$ after all (the documentation of the function wasn't clear enough for me personally), but a likelihood $bound$ which must be utilised in the perplexity's lower bound equation thus (Taken from this paper - Online Learning for Latent Dirichlet Allocation by Hoffman, Blei and Bach): $$ perplexity (n^{test}, \lambda, \alpha) \leq exp\{ -(\sum_i{ \mathbb{E}_q [log_p(n_i^{test}, \theta_i, z_i | \alpha, \beta)] - \mathbb{E}_q[log_q(\theta_i, z_i)]) / (\sum_{i,w}{n_{iw}^{test}}) } \} $$

Viz.,

$$ perplexity (n^{test}, \lambda, \alpha) \leq e^{- bound} $$

Some people like to use $2$ instead of $e$ in the equation above.

For calculating $AIC$ and $BIC$, one usually needs the Bayesian likelihood of the model, not necessarily the $SSE$, especially in a topic modelling environment.

Finally, as for the $U Mass$ coherence measure, to the best of my knowledge, it hasn't been used in a model selection scenario with LDA yet, but the sharp dip I got at $k=20$ (the proper number of topics according to the 20 newsgroups dataset) is encouraging. However, topic coherence measures should be close to zero optimally, so that sharp dip isn't an improvement, rather a deterioration in the coherence (the meaningfulness or interpretability) of topics.

Related Question