Solved – What’s the relation between Matrix Factorization (MF) and Latent Dirichlet Allocation (LDA)

latent-dirichlet-allocmachine learningmatrix decompositiontext mining

My understanding is that both MF and LDA can be applied to do document classification. I will first summarize my understand about these two methods before I ask my questions.

Assuming we use a big matrix X to summarize the documents in a corpus and words in the vocabulary, where

X is a W x D matrix, and

$X_{ji}$ represents word-j's counts in doc-i,

D is the number of documents in the corpus,

W the number of words in the vocabulary.

1. Matrix Factorization

enter image description here

In its economical form,

$U$ is a $W \times K$ matrix,

$\Sigma$ a $K \times K$ matrix,

$V$ a $D \times K$ matrix

If we write $\Sigma \equiv S^2$, we can factorize $X$ into two matrices:

$X = U \Sigma V^T = U S^2 V^T = US S V^T \equiv A B^T $

2. Latent Dirichlet Allocation

enter image description here

Assume each document is a mixture of $\mathbb{K}$ topics, and each topic has its word distribution. Each word in a doc is drawn from one of the topics.

\begin{equation}
\pi_i | \alpha \sim Dir(\alpha \mathbb{1}_\mathbb{K}) \\
q_{il} | \pi_i \sim Cat(\pi_i) \\
b_k | \gamma \sim Dir(\gamma \mathbb{1}_W) \\
y_{il} | q_{il}=k, b_k \sim Cat(b_k)
\end{equation}

where $\pi_i$ is doc-i's topic distribution,

$\alpha$ the user-specified parameter for the topic distribution (Dirichlet),

$\gamma$ the user-specified parameter for the topic-specific word distribution (also Dirichlet),

$q_{il} \in \{1, 2, \dots, \mathbb{K}\}$ the topic for the $l^{th}$ word in doc-i,

$y_{il} \in \{1, 2, \dots, W \} $ the identity for the $l^{th}$ word in doc-i,

$b_k$ the topic-specific word distributions, for $k=1, \dots, \mathbb{K}$,

$L_i$ the plate notation is the length of doc-i.

Summary

In MF, we basically factorize $X$ into two matrices, $A$ and $B$. We can interpret one matrix as document-specific features and the other as word-specific features.

In LDA, we use the data to infer two distribution matrices:

$
\Pi = \big(\pi_1, \pi_2, \dots, \pi_D\big),
$

which is a $\mathbb{K} \times D$ matrix, with each column being the distribution over topics for a given doc. And

$
\mathcal{B} = \big(b_1, b_2, \dots, b_\mathbb{K} \big),
$

which is a $W \times \mathbb{K}$ matrix with each column being the topic-specific distribution over words in the vocabulary.

Questions

  1. Is there a way to formally interpret LDA as a form of matrix factorization? Namely, does that make sense to write X in terms of the product of $\mathcal{B}$ and $\Pi$ ?

  2. As a common application in MF is to find the items that are similar to a target item. If there is indeed a correspondence between MF and LDA, how do I perform this similarity finding in LDA model?

Best Answer

This paper suggests an answer:

Faleiros, Thiago de Paulo, and Alneu de Andrade Lopes. "On the equivalence between algorithms for non-negative matrix factorization and latent Dirichlet allocation." European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, XXIV. European Neural Network Society-ENNS, 2016. (PDF link)