Here is a nice paper that addresses some of the 'systemic' shortcomings of the Multinomial Naive Bayes (MNB) classifier. The idea is that you can boost the performance of MNB through some tweaks. And they do mention using (uniform) Dirichlet priors.
Overall if you're interested in MNB and you haven't read this paper yet, I would strongly recommend to do so.
I also found an accompanying MSc thesis by the same person / people but haven't read it myself yet. You can check it out.
I found this truly excellent review that describes precisely how Hierarchical Dirichlet Processes work.
First, start by choosing a base distribution $H$. In the case of topic modeling, we have a Dirichlet distribution as a prior for $H$. The dimension of this distribution should describe a distribution of words for each topic. Therefore, the dimension should be equal to the size of the vocabulary $V$. In the example described in the review, the author assumes a vocabulary of 10 words, so he uses $H = \text{Dirichlet}(1/10,...,1/10)$. As usual, a realization of this distribution generates a 10-dimensional vector $\theta_{k}$ of proportions.
After this, $H$ is used to build a Dirichlet Process $DP(\gamma, H)$ and a realization $G_{0}$ of this process is another discrete distribution with locations $\{\theta_{k}\}$ where each $\theta_{k}$ describes the distribution over words for a topic $k$. If we use $G_{0}$ as a base distribution for another Dirichlet Process $DP(\alpha_{0}, G_{0})$, it is possible to obtain a realization $G_{j}$ for every document $j$ in such a way that $G_{j}$ has the same support as $G_{0}$. Therefore, every $G_{j}$ shares the same set of $\theta_{k}$'s, although with different proportions (which are called mixing weights in the definition of a Dirichlet Process)
Finally, for every document $j$ and every word $i$, we draw a realization from $G_{j}$ which generates a particular vector $\theta_{k}$. Since this $\theta_{k}$ is a distribution over words for a given topic, we only need to sample from a multinomial distribution using $\theta_{k}$ as parameter in order to sample words $w_{ji}= \text{Multinomial}(\theta_{k})$.
I have seen that sometimes $\phi_{ji}$ is defined as $\phi_{ji}=\theta_{k}$ for every document $j$ and word $i$. Sometimes, it is easier to use a variable $z_{ji}$ that works as an index to sample from the probabilities $\pi_{jk}$ of $G_{j}$ (in $G_{j} = \sum_{k=1}^{\infty} \pi_{jk} \delta_{\theta_{k}}$) and then used as in $\theta_{z_{ji}}$. However, I think this is done in the context of the stick-breaking construction.
Best Answer
Unfortunately,
scipy.stats
doesn't provide the logistic normal distribution. However, you could draw random samples from a multivariate normal distribution (e.g. usingnumpy
) and transform them with a logistic transformation to simulate samples drawn from the logistic-normal distribution.Let's assume your probability vectors are $D=3$ dimensional.
Now (as you can read here), you can transform your normally distributed sample $y \in \mathcal{S}^{D-1}$ to a logistic-normally distributed sample $x \in \mathcal{S}^{D}$:
$$ \mathbf{y} = \left[ \log \left( \frac{ x_1 }{ x_D } \right) , \dots , \log \left( \frac{ x_{D-1} }{ x_D } \right) \right] $$
$$ \mathbf{x} = \left[ \frac{ e^{ y_1 } }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } , \dots , \frac{ e^{ y_{D-1} } }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } , \frac{ 1 }{ 1 + \sum_{i=1}^{D-1} e^{ y_i } } \right] $$