In the second paper you linked to:
Naive Bayes represents a distribution as a mixture of components, where within each component all variables are assumed independent of each other. Given enough components, it can approximate an arbitrary distribution arbitrarily closely. ... When learned from data, naive Bayes models never contain more components than there are examples in the data ...
and then later, the quote you mentioned,
Naive Bayes models can be viewed as Bayesian networks in which each $X_i$ has $C$ as the sole parent and $C$ has no parents. A naive Bayes model with Gaussian $P(X_i |C )$ is equivalent to a mixture of Gaussians with diagonal covariance matrices (Dempster et al., 1977).
Then they go on to show that, in the discrete case, $P(X=x)=\sum_{c=1}^k\left(P(c)\prod_{i=1}^{|X|}P(x_i|c)\right)$.
So consider $X\sim N(\mu,\Sigma^\text{diag})$. It's easy to see that:
$$
P(X=x)=\sum_{c=1}^kP(c)\prod_{i=1}^{|X|}P(x_i|c)=\sum_{c=1}^kP(c)\cdot P(x_1,\dots,x_{|X|}|c)
$$
since the joint distribution of $X$ is just the product of the distributions of its independent components. This, of course, is a mixture of $k$ Gaussians, each with weight $P(c)$. So unless they're talking about something else, I think I was right in the comments. What makes me less than 100% confident is that a) they talk about "enough components" which makes me think it should be more than $k$, and b) I have no idea where in Dempster (1977) this is or why they cite that paper of all things. I tried sifting through Dempster but it's on a completely different subject (E-M algorithm for ML with missing data) and I don't think they even had the term "Naive Bayes" back then. APA really needs to start requiring page numbers somehow.
I think the reason they point this out is that they develop Naive Bayes from a graphical / network perspective, so this provides a different intuition.
- which other clustering methods (unsupervised classification) can I try for this problem?
For instance, parametric ones: you can fit a Gaussian Mixture Model by Expectation Maximization or Variational Bayes Inference; you test for different number of clusters and select the model that best fits your data. Be careful, model selection is not the same for a non-bayesian method (such as EM), where you have a point estimate of the clustering, than for a bayesian method such has VB where you have a full posterior distribution.
- In DPMM (Dirichlet process mixture model), it is assumed that each
mixture component is Gaussian. Can non-Gaussian distribution be used
for mixture components?
The priors of the variables of a simple GMM are:
\begin{align*}
x_i | z_i, \theta_{z_i} &\sim F(\theta_{z_i})\\
\theta_j &\sim G_0\\
z_i &\sim \text{Discrete}(\boldsymbol{\pi})\\
\pi &\sim \text{Dirichlet}(\boldsymbol{\alpha})
\end{align*}
where $F$ is a Normal distribution and $\pi$ has length $k$. The collapsed version comes after integrating out $\pi$. The Dirichlet Process / Chinese Restaurant Process arrive naturally when you compute the limit $k \rightarrow \infty$.
You can replace the likelihood $F$ by whatever function you like and you will get a mixture of Bernoullis, a mixture of Poissons, a mixture of...
3.a. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed.
Not at all!. Why should it? Gibbs sampling, and in general any MCMC method, needs some burn-in sampling until your Markov Chain arrives to the stationary distribution (the posterior distribution you are looking for). Once you are there, you keep sampling from your posterior until you are happy. But it is not possible a priori to know how many iterations you need to get to the stationary distribution. Actually, even a posteriori examining your chain of samples (aka traces), there are convergence tests but you never know for sure.
3.b. Is it possible the number of iterations is adaptive depend on the data and the
number of components?
So, as for adaptive techniques, you can periodically test whether your chain(s) (one for every variable you sample) converged, and even stop sampling if the test gives positive results. But it might be a false positive. And yet, you might have to check other things such as autocorrelation (to decide the thinning).
Usually the more variables you have (and more components also mean more variables), the longer it will take the Gibbs sampler to converge, but there is no mathematical formula for that.
Best Answer
After fitting the first Gaussian component, you have its mean and covariance. To split it just means to create a new model with two components instead of one. I'd suggest the new components should have the same mean as the first, but with half the covariance added or subtracted, i.e. $\mu_1 = \mu_0 + 0.5 \sigma_0$ and $\mu_2 = \mu_0 - 0.5 \sigma_0$. Intuitively, this shares the data roughly equally between the two new components. Then re-optimise the two-component model and repeat. At each stage, you could take the component with the largest covariance and replace it with two components.
[EDIT] The exact method to initialise the component shouldn't be critical, as the parameters will be optimised further anyway. See also Ueda et al. 2000 for example. One option they suggest is to define the new means by random peturbations of the old component's mean and with unit covariance.