I am studying Latent Dirichlet Allocation (LDA) model, and I found some explanations around the web (for example here on Quora.com).
In the link examples, I can clearly see which are the topics author is talking about (food and cute animals).
I understood how the model works when you have an idea about the topics meaning. But what happens when you do not know the topics meaning?
How LDA model could tell you what are the topics about?
How LDA model could tell you how many topics are there?
For example, if you're running the LDA algorithm to analyze occurrences of genes and their functions, how could the model tell you if the topics are about diseases, or metabolic pathways, or genetic disorders, or any other concept that relates genes and functions?
Best Answer
What LDA does, and what it can answer
Consider this snippet from the paper introducing supervised LDA:
In other words, for a given corpus and trained LDA model of fixed $k$, that's all you get: The latent topics that maximize the posterior probability of the observed corpus.
Now, that's not to say that a domain subject matter expert couldn't make some intuitive guesses in the right direction. Take a look at these topics from an LDA model trained for $k = 16$ on the handwritten digits data that ships with
sklearn
:Some are entirely recognizable as digits; some we're left to speculate about or further analyze, maybe "half a nine" or "one common way of writing a seven." (See the code below to produce this and a few other plots of varied number of topics.)
How many topics, via hierarchal topic models
Above, our choice of $k$ was taken from a quick look through an arbitrary space of possible parameters. This was straightforward since we rather expect that the number of meaningful topics won't be too far removed from ten, the number of digits.
In your case, there's no mention of prior knowledge that justifies either a chosen $k$, or even a subspace to search. Hierarchal topic models can handle this in a principled fashion, by employing Dirichlet processes. (Loosely, DPs can be thought of as an infinite-dimensional generalization of the Dirichlet distribution.) Empirically, it's been shown to choose $k$ similar to the LDA model that minimizes perplexity. From the paper:
Though hierarchal topic models can handle a single layered hierarchy, they were motivated by more elaborate models of dependency within and between groups, which may interest you:
They go on further to detail an example of likely interest:
So, you can use hierarchal models simply to choose the number of topics, or to model much more elaborate group relationships. (I've not the slightest bioinformatics expertise, so I can't even begin to suggest what would be useful or appropriate, but I hope the details in the paper can help guide you.)
What the topics mean, via sLDA
Finally, if your data includes response variables you'd like to predict, e.g. the diseases or genetic disorders you mention, then supervised LDA is probably what you're looking for. From the paper linked above, emphasis mine:
A brief aside: Cited in the sLDA paper is this one, which may be of interest:
Code