Solved – Latent Dirichlet Allocation vs. pLSA

dirichlet distributionlatent-semantic-analysislatent-semantic-indexinglatent-variableoverfitting

In the original LDA paper it is stated that:

The parameters for a k-topic pLSI model are k multinomial distributions of size V and M mixtures over the k hidden topics. This gives kV +kM parameters and therefore linear growth in M. The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem[.]

Also:

LDA is a well-defined generative model and generalizes easily to new documents. Furthermore, the k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

But what I understand is that LDA also has those $kV + kM$ parameters but not as hyper-parameters. So this is irrelevant to overfitting. I.e., in pLSA these posteriors must be estimated ($M$ is the number of documents):

$p(z|d): kM$ parameters,

$p(w|z): kV$ parameters,

and in LDA the following posteriors have to be estimated:

$p(\Theta_d|\alpha): kM$ parameters ($\Theta_d$ is $k$-dimensional),

$p(w|z): kV$ parameters,

and two parameters $\alpha$ and $\eta$, (called hyperparameters).

Thus, the number of posteriors to be estimated is approximately the same. Why LDA is claimed to have solved overfitting problem of pLSA? I agree that since Dirichlet distribution with a low $\alpha$ tends to generate sparser distributions than Dirichlet with $\alpha=1$ (or uniform) as in pLSA, and this sparsity might help reducing the overfitting a bit, but still the number of parameters are similar.

Best Answer

We see that pLSI describes a process for generating documents with topic distributions p(z | d) seen in a particular document in the collection as opposed to generating documents with arbitrary topic proportions from a prior probability distribution. This may not be crucial in information retrieval where the current document collection to be stored can be viewed as a fixed collection. However, in applications such as text categorization, it is crucial to have a model flexible enough to properly handle text that has not been seen before.

Thus probability of documents in pLSI are points (one point - one document from your collection), while in LDA, there is full topic simplex to use (of course, after training you have dirichlet distribution), so there are no problems to take new document.

Related Question