Solved – What’s the relation between Matrix Factorization (MF) and Latent Dirichlet Allocation (LDA)

latent-dirichlet-allocmachine learningmatrix decompositiontext mining

My understanding is that both MF and LDA can be applied to do document classification. I will first summarize my understand about these two methods before I ask my questions.

Assuming we use a big matrix X to summarize the documents in a corpus and words in the vocabulary, where

X is a W x D matrix, and

$X_{ji}$ represents word-j's counts in doc-i,

D is the number of documents in the corpus,

W the number of words in the vocabulary.

1. Matrix Factorization

In its economical form,

$U$ is a $W \times K$ matrix,

$\Sigma$ a $K \times K$ matrix,

$V$ a $D \times K$ matrix

If we write $\Sigma \equiv S^2$, we can factorize $X$ into two matrices:

$X = U \Sigma V^T = U S^2 V^T = US S V^T \equiv A B^T $

2. Latent Dirichlet Allocation

Assume each document is a mixture of $\mathbb{K}$ topics, and each topic has its word distribution. Each word in a doc is drawn from one of the topics.

\begin{equation}
\pi_i | \alpha \sim Dir(\alpha \mathbb{1}_\mathbb{K}) \\
q_{il} | \pi_i \sim Cat(\pi_i) \\
b_k | \gamma \sim Dir(\gamma \mathbb{1}_W) \\
y_{il} | q_{il}=k, b_k \sim Cat(b_k)
\end{equation}

where $\pi_i$ is doc-i's topic distribution,

$\alpha$ the user-specified parameter for the topic distribution (Dirichlet),

$\gamma$ the user-specified parameter for the topic-specific word distribution (also Dirichlet),

$q_{il} \in \{1, 2, \dots, \mathbb{K}\}$ the topic for the $l^{th}$ word in doc-i,

$y_{il} \in \{1, 2, \dots, W \} $ the identity for the $l^{th}$ word in doc-i,

$b_k$ the topic-specific word distributions, for $k=1, \dots, \mathbb{K}$,

$L_i$ the plate notation is the length of doc-i.

Summary

In MF, we basically factorize $X$ into two matrices, $A$ and $B$. We can interpret one matrix as document-specific features and the other as word-specific features.

In LDA, we use the data to infer two distribution matrices:

$
\Pi = \big(\pi_1, \pi_2, \dots, \pi_D\big),
$

which is a $\mathbb{K} \times D$ matrix, with each column being the distribution over topics for a given doc. And

$
\mathcal{B} = \big(b_1, b_2, \dots, b_\mathbb{K} \big),
$

which is a $W \times \mathbb{K}$ matrix with each column being the topic-specific distribution over words in the vocabulary.

Questions

Is there a way to formally interpret LDA as a form of matrix factorization? Namely, does that make sense to write X in terms of the product of $\mathcal{B}$ and $\Pi$ ?
As a common application in MF is to find the items that are similar to a target item. If there is indeed a correspondence between MF and LDA, how do I perform this similarity finding in LDA model?

Best Answer

This paper suggests an answer:

Faleiros, Thiago de Paulo, and Alneu de Andrade Lopes. "On the equivalence between algorithms for non-negative matrix factorization and latent Dirichlet allocation." European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, XXIV. European Neural Network Society-ENNS, 2016. (PDF link)

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

I'd try 'folding in'. This refers to taking one new document, adding it to the corpus, and then running Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same. This usually converges fast (maybe 5-10-20 iterations), and you don't need to sample your old corpus, so it also runs fast. At the end you will have the topic assignment for every word in the new document. This will give you the distribution of topics in that document.

In your Gibbs sampler, you probably have something similar to the following code:

// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Assign current token to a random topic, updating the count matrices
    end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Compute probability of current token being assigned to each topic
       Sample a topic from this distribution
       Assign the token to the new topic, updating the count matrices
    end
end

Folding-in is the same, except you start with the existing matrices, add the new document's tokens to them, and do the sampling for only the new tokens. I.e.:

Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
   Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
   Compute probability of current token being assigned to each topic
   Sample a topic from this distribution
   Assign the token to the new topic, updating the count matrices
end

If you do standard LDA, it is unlikely that an entire document was generated by one topic. So I don't know how useful it is to compute the probability of the document under one topic. But if you still wanted to do it, it's easy. From the two matrices you get you can compute $p^i_w$, the probability of word $w$ in topic $i$. Take your new document; suppose the $j$'th word is $w_j$. The words are independent given the topic, so the probability is just $$\prod_j p^i_{w_j}$$ (note that you will probably need to compute it in log space).

Machine Learning – Why LDA (Latent Dirichlet Allocation) Works by Grouping Co-Occurring Words

Technically LDA Gibbs sampling works because we intentionally set up a Markov chain that converges into the posterior distribution of the model parameters, or word–topic assignments. See http://en.wikipedia.org/wiki/Gibbs_sampling#Mathematical_background.

But I guess you are seeking an intuitive answer on why the sampler tends to put similar words into the same topic? That's an interesting question. If you look at the equations for collapsed Gibbs sampling, there is a factor for words, another for documents. Probabilities are higher for assignments that "don't break document boundaries", that is, words appearing in the same document have a slightly higher odds of ending up in the same topic. The same holds for document assignments, they to a degree follow "word boundaries". These effects mix up and spread over clusters of documents and words, eventually.

By the way, LDA Gibbs samplers do not actually work properly, in the sense that they do not mix, or are not able to represent the posterior distribution well. If they did, the permutation symmetries of the model would make all solutions obtained by samplers useless, or at least non-interpretable. Instead the sampler sticks around a local mode (of the likelihood), and we get well-defined topics.