Solved – choosing prior parameters for variational mixture of Gaussians

I am implementing a vanilla variational mixture of multivariate Gaussians, as per Chapter 10 of Pattern Recognition and Machine Learning (Bishop, 2007).

The Bayesian approach requires to specify (hyper) parameters for the Gaussian-inverse-Wishart prior:

$\alpha_0$ (concentration parameter of the Dirichlet prior);
$\nu_0$ (degrees of freedom of an inverse Wishart distribution);
$\beta_0$ (pseudo-observations for the Gaussian-inverse Wishart distribution);
$\mathbf{m}_0$ (mean of the Gaussian distribution).
$\mathbf{W}_0$ (scale matrix for the inverse Wishart).

Common choices are $\alpha_0 = 1$, $\nu_0 = d + 1$, $\beta_0 = 1$, $\textbf{m}_0 = \textbf{0}$, $\textbf{W}_0 = \textbf{I}_d$, where $d$ is the dimensionality of the space.

Unsurprisingly, the posterior can depend strongly on the choice of parameters (in particular, I find that $\textbf{W}_0$ has a large impact on the number of components, much more than $\alpha_0$). For $\textbf{m}_0$ and $\textbf{W}_0$, the choices above make sense only if the data have been somewhat normalized.

Following a sort-of empirical Bayes approach I was thinking of setting $\textbf{m}_0$ and $\textbf{W}_0^{-1}$ equal to the empirical mean and empirical covariance matrix of the data (for the latter, I could perhaps only consider the diagonal; also, I need to multiply the sample covariance matrix by $\nu_0$). Would this be sensible? Any suggestion on other reasonable methods to set the parameters? (without going fully hierarchical Bayes and DPGMM)

(There is a similar question here, but no answer that is relevant to my question.)

Best Answer

Good priors depend on your actual problem - in particular, I don't believe there are any truly universal defaults. One good way is to try to formulate (possibly weak and vague) domain-specific knowledge about the process that generated your data, e.g.:

"It's highly unlikely to have more than 12 components"
"It's highly unlikely to observe values larger than 80"

Note that those should not generally be informed by the actual data you collected but by what you would be able to say before gathering the data. (e.g. the data represent outdoor temperatures in Celsius therefore they will very likely lie in $[-50,80]$ even before looking at data). It is also OK to motivate your priors by the computational machinery you use (e.g. I will collect 100 datapoints, hence I can safely assume it is unlikely to have more than 10 components since I won't have enough data to locate more components anyway)

Some of those statements can be translated directly into priors - e.g. you can set $m_0$ and $W_0^{-1}$ so that 95% of the prior mass is over the expected range of values.

For the less intuitive parameters (or just as another robustness check), you can follow the Visualization in Bayesian workflow paper and do prior predictive checks: this means that you simulate a large number of new datasets starting from your prior. You can then visualize them to see if they

don't violate your expectations too often (it is good to leave some room for surprises, hence aiming for something like 90% or 95% of simulations within your constraints)
otherwise cover the whole spectrum of values reasonably well

Best Answer

Related Solutions

Solved – Hyperprior distributions for the parameters (scale matrix and degrees of freedom) of a wishart prior to an inverse covariance matrix

Solved – Derivation of M-step in EM algorithm for mixture of Gaussians

Related Question